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Online Matrix Completion and Online Robust PCA 


Brian Lois and Namrata Vaswani 


Abstract 

This work studies two interrelated problems - online robust PCA (RPCA) and online low-rank matrix completion 
(MC). In recent work by Candes et ak, RPCA has been defined as a problem of separating a low-rank matrix (true 
data), L := £2, ■ • ■ ■ • •,^^d a sparse matrix (outliers), S := [a;i, a;2, ■ • ■ ■ ■ ■, from their sum, 

M := L + S. Our work uses this definition of RPCA. An important application where both these problems occur is 
in video analytics in trying to separate sparse foregrounds (e.g., moving objects) and slowly changing backgrounds. 

While there has been a large amount of recent work on both developing and analyzing batch RPCA and batch 
MC algorithms, the online problem is largely open. In this work, we develop a practical modification of our recently 
proposed algorithm to solve both the online RPCA and online MC problems. The main contribution of this work 
is that we obtain correctness results for the proposed algorithms under mild assumptions. The assumptions that 
we need are; (a) a good estimate of the initial subspace is available (easy to obtain using a short sequence of 
background-only frames in video surveillance); (b) the ^j’s obey a ‘slow subspace change’ assumption; (c) the basis 
vectors for the subspace from which it is generated are dense (non-sparse); (d) the support of Xt changes by at 
least a certain amount at least every so often; and (e) algorithm parameters are appropriately set. 


I. Introduction 

Principal Components Analysis (PCA) is a tool that is frequently used for dimension reduction. Given a matrix 
of data D, PCA computes a small number of orthogonal directions, called principal components, that contain most 
of the variability of the data. For relatively noise-free data that lies close to a low-dimensional subspace, PCA is 
easily accomplished via singular value decomposition (SVD). The problem of PCA in the presence of outliers is 
referred to as robust PCA (RPCA). In recent work, Candes et al. ||2l posed RPCA as a problem of separating a 
low-rank matrix, L, and a sparse matrix, S, from their sum, M := L-\-S. They proposed a convex program called 
principal components’ pursuit (PCP) that provided a provably correct batch solution to this problem under mild 
assumptions. PCP solves 

min ||L||* + A||S||sum subject to L + S = M, 

L,S 

where || • ||* is the nuclear norm (sum of singular values), || • ||sum is the sum of the absolute values of the entries, 
and A is an appropriately chosen scalar. The same program was analyzed in parallel by Chandrasekharan et al. ||3l 
and later by Hsu et al. ||4l. Since these works, there has been a large amount of work on batch approaches for 
RPCA and their performance guarantees. 

When RPCA needs to be solved in a recursive fashion for sequentially arriving data vectors it is referred to as 
online (or recursive) RPCA. Online RPCA assumes that a short sequence of outlier-free (sparse component free) 
data vectors is available. An example application where this problem occurs is the problem of separating a video 
sequence into foreground and background layers (video layering) on-the-fly El. Video layering is a key first step 
for automatic video surveillance and many other streaming video analytics tasks. In videos, the foreground usually 
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consists of one or more moving persons or objects and hence is a sparse image. The background images usually 
change only gradually over time ||2l, e.g., moving lake waters or moving trees in a forest, and hence are well 
modeled as lying in a low-dimensional subspace that is fixed or slowly changing. Also, the changes are global 
(dense) 1|2|. In most video applications, it is valid to assume that an initial short sequence of background-only 
frames is available and this can be used to estimate the initial subspace via SVD. 

Often in video applications the sparse foreground Xt is actually the signal of interest, and the background It 
is the noise. In this case, the problem can be interpreted as one of recursive sparse recovery in (potentially) large 
but structured noise. Our result allows for It to be large in magnitude as long as it is structured. The structure we 
impose is that the lt% lie in a low dimensional subspace that changes slowly over time. 

In some other applications, instead of there being outliers, parts of a data vector may be missing entirely. When 
the (unknown) complete data vector is a column of a low-rank matrix, the problem of recovering it is referred to 
as matrix completion (MC). For example, recovering video sequences and tracking their subspace changes in the 
presence of easily detectable foreground occlusions. If the occluding object’s intensity is known and is significantly 
different from that of the background, its support can be obtained by simple thresholding. The background video 
recovery problem then becomes an MC problem. A nuclear norm minimization (NNM) based solution for MC was 
introduced in Q and studied in IQ. The convex program here is to minimize the nuclear norm of M subject to 
M and M agreeing on all observed entries. Since then there has been a large amount of work on batch methods 
for MC and their correctness results. 


A. Problem Definition 

Consider the online MC problem. Let Tt denote the set of missing entries at time t. We observe a vector rrit G M” 
that satisfies 


TTlt — i-t 


for t — ftrain T 1) Strain T 2, . . . , fjj 


( 1 ) 


with the possibility that fmax can be infinity too. Here £t is such that, for t large enough (quantified in Model 2.21, 
fhe mafrix Lt := [£i,£ 2 , • ■ ■,is a low-rank mafrix. Nofice fhaf by defining m* as above, we are seffing fo zero 
fhe enfries fhat are missed (see fhe notafion secfion on page |^. 

Consider fhe online RPCA problem. Af lime t we observe a vector m* G that satisfies 


mt=£t + Xt for t — ftrain T 1; Strain T 2, . . . , 


( 2 ) 


Here £t is as defined above and Xt is fhe sparse (oullier) vector. We use Tt to denote the support set of Xt. 

For both problems above, for f = 1, 2,... ,ftrain> we are given complete outlier-free measurements mt = £t so 
that it is possible to estimate the initial subspace. For the video surveillance application, this would correspond to 
having a short initial sequence of background only images, which can often be obtained. For t > ftrain> the goal 
is to estimate £t (or £t and Xt in case of RPCA) as soon as mt arrives and to periodically update the estimate of 
range (Lt). 

In the rest of the paper, we refer to Tt as the missing/corrupted entries’ set. 


B. Related Work 

Some other work that also studies the online MC problem (defined differenlly from above) includes Q, IH, lO, 
ifTOl . We discuss fhe connecfion wifh fhe idea from 0 in Section IV The algorithm from ||8l, GROUSE, is a first 
order stochastic gradient method; a result for its convergence to the local minimum of the cost function it optimizes 
is obtained in liTOll . The algorithm of |[9l, PETRELS, is a second order stochastic gradient method. It is shown in 
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Il9l that PETRELS converges to the stationary point of the cost function it optimizes. The advantage of PETRELS 
and GROUSE is that they do not need initial suhspace knowledge. Another somewhat related work is lITTIl . 

Partial results have been provided for ReProCS for online RPCA in our older work ifT^ . In other more recent 
work in another partial result is obtained for online RPCA defined differently from above. Neither of these is a 
correctness result. Both require an assumption that depends on intermediate algorithm estimates. Another somewhat 
related work is llT4l on online PCA with contaminated data. This does not model the outlier as a sparse vector but 
defines anyfhing fhaf is far from fhe dafa subspace as an ouflier. 

Some other works only provide an algorithm without proving any performance results, e.g., lITSl . 

We discuss the most related works in detail in Sec ITTT-CI 


C. Contributions 

In this work we develop and study a practical modification of the Recursive Projected Compressive Sensing 
(ReProCS) algorithm introduced and studied in our earlier work ifT^ for online RPCA. We also develop a special 
case of it that solves the online MC problem. The main contribution of this work is that we obtain a complete 
correctness result for ReProCS-based algorithms for both online MC and online RPCA (or more generally, online 
sparse plus low-rank matrix recovery). Online algorithms are useful because they are causal (needed for applications 
like video surveillance) and, in most cases, are faster and need less storage compared to most batch techniques (we 
should mention here that there is some recent work on faster batch techniques as well, e.g., OH). To the best of 
our knowledge, this work and an earlier conference version of this |[T1 may be among the first correctness results 
for online RPCA. The algorithm studied in lUl is more restrictive. 

Moreover, as we will see, by exploiting temporal dependencies, such as slow subspace change, and initial subspace 
knowledge, our result is able to allow for a more correlated set of missing/corrupted entries than do the various 
results for PCP Jll, Q, 0 or NNM l|6l (see Sec, [nil. 

Our result uses the overall proof approach introduced in our earlier work IfT^ that provided a partial result 


for online RPCA. The most important new insight needed to get a complete result is described in Section IV-C 


Also see Sec. III-C| New proof techniques are needed for this line of work because almost all existing works 
only analyze batch algorithms that solve a different problem. Also, as explained in Section the standard PCA 
procedure cannot be used in the subspace update step and hence results for it are not applicable. 

As shown in O, because it exploits initial subspace knowledge and slow subspace change, ReProCS has 
significantly improved recovery performance compared with batch RPCA algorithms - PCP and ITSl - as well 
as with the online algorithm of lITSl for foreground and background extraction in many simulated and real video 
sequences; it is also faster than the batch methods but slower than |[T5l . 


D. Notation 

We use ' to denote transpose. The 2-norm of a vector and the induced 2-norm of a matrix are denoted by || • || 2 . 
Eor a set T of integers, \T\ denotes its cardinality and T denotes its complement set. We use 0 to denote the empty 
set. Eor a vector x, xj- is a smaller vector containing the entries of x indexed by T. Define Ip fo be an n x \T\ 
matrix of those columns of the identity matrix indexed by T. Eor a matrix A, define Ap := Alp. Eor mafrices 
P and Q where fhe columns of Q are a subsef of fhe columns of P, P\Q refers fo fhe mafrix of columns in P 
and nol in Q. 

Eor an n X n Hermifian mafrix H, H UAU' denotes an eigenvalue decomposition. That is, U has 
orthonormal columns, and A is a diagonal matrix of size at least rank(iT) x rank(i3'). (If H is rank deficient, 
then A can have any size between rank(iT) and n.) Eor Hermitian matrices A and B, the notation A < B means 
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that B — A IS positive semi-definite. We order the eigenvalues of an Hermitian matrix in decreasing order. So 

Ai > A2 > • • • > \n- 

For integers a and b, we use the interval notation [a, b] to mean all of the integers between a and b, inclusive, 
and similarly for [a, b) etc. 

Definition 1,1. For a matrix A, the restricted isometry constant (RIC) ds{A) is the smallest real number 6s such 
that 

(1 - (is)||s||| < \\Ax\\l < (1 -h (5s)||®||2 

for all s-sparse vectors x l[19\l . A vector x is s-sparse if it has s or fewer non-zero entries. 

Definition 1.2. 'We refer to a matrix with orthonormal columns as a basis matrix. Notice that if P is a basis matrix, 
then P'P = I. 

Definition 1.3. For basis matrices P and P, define dif(F', i-*) := ||(/ — PP')P\\ 2 . This quantifies the difference 
between their range spaces. If P and P have the same number of columns, then dif(P, P) = dif(P, P), otherwise 
the function is not necessarily symmetric. 


E. Organization 

The remainder of the paper is organized as follows. In Section |T^ we give the model and main result for both 
online MC and online RPCA. Next we discuss our main results in Section ITI^ The algorithms for solving both 


problems are given and discussed in Section IV The discussion also explains why the proof of our main result 


should go through. Section IV-C| within this section describes the key insight needed by the proof and Section IV-D 


gives the proof outline. The most general form of our model on the missing entries set, %, is described in Section 
[V| A key new lemma for proving our main results is also given in this section. The proof of our main results can be 
found in Section VI Proofs of three long lemmas needed for proving the lemmas leading to the main theorem are 


postponed until Section VII Section VIII shows numerical experiments backing up our claims. We discuss some 


extensions in Section IX and give conclusions in Section 


II. Online Matrix Completion: Assumptions and Main Result 

Before we give our model on It, we need the following definition. 

Definition 2.1. Recall that rrit = £t for t = 1,..., ftrain A the training data. Let be the minimum non-zero 

eigenvalue of i-A— ftitmt. That is 

Strain J- 

Strain 

-- mtmt' 

ftrain 

Define to be the matrix containing the eigenvectors of with eigenvalues larger than or 

equal to as its columns. 

We will use Attain the initial subspace knowledge in the algorithms. We will use in our algorithms to 

set the eigenvalue threshold to both detect subspace change and estimate the number of newly added directions. We 
also use to state the slow subspace change assumption below We use this to state the most general version 

of the slow subspace change assumption in Model [2^ However, as explained in the footnote in the line below Q, 
we can get a slightly more restrictive model without using . 
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Pt = Po Pt = [Pt,-l Ptun 


Pt — [Ptj-1 Ptj,, 


Pt = [Pu-t Pt 


j-1 -* tj.newJ 




Fig. 1: A diagram of Model 


2.2 


A. Model on it 

We assume that it is a vector from a fixed or slowly changing low-dimensional suhspace that changes in such 
a way that the matrix Lt := [i\,i 2 ,-- - it] is low rank for t large enough. This can he modeled in various ways. 
The simplest and most commonly used model for data that lies in a low-dimensional suhspace is to assume that at 
all times, it is is independent and identically distributed (iid) with zero mean and a fixed covariance matrix S that 
is low rank. However this is often impractical since, in most applications, data statistics change with time, alheit 
“slowly”. To model this perfectly, one would need to assume that it is zero mean with covariance St that changes 
at each time. Let St = PtAtP/ denote its diagonalization (with Pt tall); then this means that both Pt and At 
can change at each time t. This is the most general case but it but it has an identifiability problem for estimating 
the subspace of it. The subspace spanned by the columns of Pt cannot be estimated with one data point. If Pt 
has rt columns, one needs rt or more data points for its accurate estimation. So, if Pt changes at each time, it is 
not clear how all the subspaces can be accurately estimated. Moreover, in general (without specific assumptions), 
this will not ensure that the matrix Lt is low rank. To resolve this issue, a general enough but tractable option 
is to assume that Pt is constant for a certain period of time and then changes and A^ can change at each time. 
To ensure that St changes “slowly”, we assume that, when Pt changes, the eigenvalues along the newly added 
subspace directions are small for some time (d frames) after the change. One precise model for this is specified 
nexf. We also assumed boundedness of it. This is more practically valid rather than the usual Gaussian assumption 
(often made for simplicity) since most sensor data or noise is bounded. 

Model 2.2 (Model on it). Assume that the it are zero mean and bounded random vectors in that are mutually 
independent over time. Also assume that their covariance matrix St has an eigenvalue decomposition 

E[itit'] = St PtAtPt' 


where Pt changes as 

I [Pt_i Pt,new] ift = tiort 2 or ... tj 
Pt — < (d) 

I Pt_i otherwise. 

and At changes as follows. For t G [fj,fj_|_i), define At^^ew '■= Pt j,new'new and assume that 

{At^new)i,i — (.t!i) ^ 9*-^train t = 1, . . . , rj,new (4) 

where gt > 1 and vt > 1 but not too We assume that (a) fj+i — tj > d for a d > [K -|- 2)a; and (b) for 

all i, qfivi)^ < 3. Here K and a are algorithm parameters that are set in Theorem 

Other minor assumptions are as follows, (i) Define to := 1 and assume that ftrain £ [foiH)- (d) For j = 
0,1, 2,..., J, define 

:= rank(PtJ and r^^new := rank(Pt.,new)- 

*Our result would still hold if the Vi were different for each change time (i.e. Vj^i). We let them be the same to reduce notation. If we do 
not want to use here in the model on £t, we can replace by (At,new)i,i = {vi)*'~^qiXhnd (for a positive constant Abnd) instead 

and assume in the theorem that A^^j^ is close to Abnd, e.g. O.QAbnd < < l.lAbnd will suffice. 


2.7 
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and assume that rj < min(n, tj+i — tj). This ensures that, for all t > rj, the matrix Lt is low-rank. (Hi) Define 

A+ := sup Amax(At) 
t 

as the maximum eigenvalue at any time and assume that < cx). 

Observe from the above that Pt is a basis matrix and At is diagonal. We refer to the tj ’s as the subspace change 
times. 


A visual depiction of the above model can be found in Figure [T] 

Define the largest and smallest eigenvalues along the new directions for the first d frames after a subspace change 
as 

A+^ := max max Amax (At,new) and A"e^ := min min Amin {At,new) 

3 3 iG [t j j H-c?] 

The slow change model on A^^new is one way to ensure that 


\— \+ 

^train — ^new — ^new — ^^train 


(5) 


i.e. the maximum variance of the projection of It along the new directions is small enough for the first d frames 
after a change. Also the minimum variance is larger than a constant greater than zero (and hence detectable). The 
proof of our main result only relies on Q and does not use the actual slow increase model in any other way. The 
above inequality along with tj+i — tj > d > {K + 2)a quantifies “slow subspace change”. 

Notice that the above model does not put any assumption on the eigenvalues along the existing directions. In 
particular, they do not need to be greater than zero and hence the model automatically allows existing directions 
(columns of Pq_i for t G to drop out of the current subspace. It could be the case that for some time 

period, {At)i,i = 0 (for an i corresponding to a column of Pt.-i), so that the i* column of Pt^-i is not contributing 
anything to £t at that time. For the same index i, {At)i,i could also later increase again to a nonzero value. Therefore 
ro + A, new is Only a bound on the rank of "Et for t G [tj,tj+i), and not necessarily the rank itself. A more 

explicit model for deletion of directions is to let Pt change as 


Pt = 


[{Pt-l \ Pt,del) 

Pt-l 


^qnew] if t = ti or t 2 or 
otherwise. 


tj 


( 6 ) 


where Pt, del contains the columns of Pt-i for which the variance is zero. If we add the assumption that 
[Pti-i Pti,new Pt 2 ,new ■ ■ ■ Ptj,new] be a basis matrix (i.e. deleted directions cannot be part of a later Pt-,new), 
then this is a special case of Model 2.2 above. We say special case because this only allows deletions at times tj, 


whereas Model 2.2 allows deletion of old directions at any time. 

The above model assumes that Ifs are zero mean and mutually independent over time. In the video analytics 
application, zero mean is easy to ensure by letting £t be the background image at time t with an empirical ‘mean 
background image’ (computed using the training data) subtracted out. The independence assumption then models 
independent background variations around a common mean. As we explain in Section this can be easily relaxed 
and we can get a result very similar to the current one under a first order autoregressive model on the £t’s. 

For t G [tj,tj+i), let Pt,^ := Pt^-i and := Pt,fEtPt,^. Observe that Model 2.2 does not have any constraint 


on At *. Thus if we assume that its entries are such that their changes from f to f + 1 are smaller than or equal to 
||At^new — At+i^newlb, then dearly, — 1) for all t G [tj,tj + d\ and all j Since d is large. 


^This follows because |lSt ||2 > IjAt^newlb = maxi(vi)‘ and |lSt+i - St ||2 < |lAt+i,„ew - At,new ||2 < 

< rnaxi(vi)‘“*^ maxi(vi —1). Thus the ratio is bounded by maxi(vi — l) < ( 3 /qifD _i < 

since qi > 1. 
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(a) £1 = 3 and /3 = 5 case 


• 7"[i] 

7'[2] 

^[S] 

7"[4] 

7"[5] 

I ' 

(b) £1 = 1 and /3 = 1 case 


Fig. 2: Examples of Model 2.3 (a) shows a ID video object of length s that moves by at least s/3 pixels once 


every 5 frames, (b) shows the object moving by s at every frame, (b) is an example of the best case for our result 
- the case with smallest p, /3 (7t’s mutually disjoint) 


the upper bound is a small quantity, i.e. the covariance matrix changes slowly. For later time instants, we do not 
have any requirement (and so in particular could still change slowly). Hence the above model includes “slow 
changing” and low-rank "Et as a special case. 


B. Model on the set of missing entries or the outlier support set, Tt 

Our result requires that the set of missing entries (or the outlier support sets), Tt, have some changes over time. 
We give one simple model for it below. One example that satisfies this model is a video application consisting of 
a foreground with one object of length s or less that can remain static for at most /? frames at a time. When it 
moves, it moves downwards (or upwards, but always in one direction) by at least s/p pixels, and at most s/p 2 
pixels. Once it reaches the bottom of the scene, it disappears. The maximum motion is such that, if the object were 
to move at each frame, it still does not go from the top to the bottom of the scene in a time interval of length a, 
i.e. -^a <n. Anytime after it has disappeared another object could appear. We show this example in Fig. 

Model 2.3 (model on Tt). Let with denote the times at which Tt changes and let denote the 

distinct sets. For an integer a (we set its value in Theorem \2.7\ , assume the following. 

1) Assume that Tt = T^^^ for all times t G with — t^) < fj and \T^^^ \ < s. 

2) Let p be a positive integer so that for any k, 

7"[fc] Pi 7"[fc+p] = 


assume that 

p^/3 < 0.01a. 


k-\-a 


< n 


3) For any k, 


























































































8 


and for any k < i <k + a, 

(7-[fc] ^ 7-[fc + l]) p y ^[i + 1]) ^ 0^ 

(One way to ensure < n is to require that for all i, |TW ^ with -^a < n.) 

In this model, k takes values 1,2,...; the largest value it can take is tmax (this will happen if It changes at every 
time). 

Clearly the video moving object example satisfies the above model as long as < 0.01a. This becomes 
clearer from Fig. 


C. Denseness 


In order to recover the €t’s from missing data or to separate them from the sparse outliers, the basis vectors for 
the subspace from which they are generated cannot be sparse. We quantify this using the incoherence condition 
from im. Let p be the smallest real number such that 


/ II2 

max \\Ptg Ii \\2 < - and max ||Pq.,new Ji ||2 ^ 


'T l|2 ^ F^jinew 

J ?■ M o \ 


n ^ n 

th nrviTi fKo i/^oviFiFx 7 moFt*iv f 


for all j 


(7) 


Recall from the notation section that It is the column of the identity matrix (or standard basis vector). We 
bound pro and prj^ae^ in the theorem. 


D. Main Result for Online Matrix Completion 

Definition 2.4. Recall that := rank(Pt^.^new) cmd rj := rank(Pi^). Define r^ew ■= niaxj rynew. tmd r = 

ro + Jrnew- 

Also define at := Pt (-t, and for t € [tj,tj+i), a^^new := Pp,new£t- Let 


7 := max ||at||oo 


and 7 new := max max ||at^new||oo 

j tG[tj,tj-\-d] 


Notice that rank(L) = rank(Pi_^^^) < r. Also, \\at \\2 < s/ry and for t G [tj,tj + d], ||at_new||2 < V^newTnew 

The following theorem gives a correctness result for Algorithm [T] given and explained in Section]^ The algorithm 
has two parameters - a and K. The parameter a is the number of consecutive time instants that are used to obtain 
an estimate of the new subspace, and K is the total number of times the new subspace is estimated before we get 
an accurate enough estimate of it. The algorithm uses and defined in Definition 2.1 and mt as inputs. 


Theorem 2.5. Consider Algorithm Assume that mt satisfies ([T]). Pick a C that satisfies 


^ < min 


10“^ 0.03A 


train 


r‘^X+ 


^ -^train 
^3^2’ ^3^2 


Suppose that the following hold. 


1) dif(P 7 ^^j„,Pq_J < roC (notice from Model 2.2 that = Pp = Pi); 


^Let 7t be the support set of the object (set of pixels containing the object). The first condition holds since there is at most one object 
of size s or less and the object cannot remain static for more than /3 frames. Since it moves in one direction by at least s/p each time it 
moves, this means that definitely after it moves p times, the supports will be disjoint (second condition). The third condition holds because 
it moves in one direction and by at most s/p 2 with -f^Q. < n (so even if it were to move at each t, i.e. if tk.+i = f* + 1 for all k, the third 
condition will hold). Also see Fig. [2| 
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2) The algorithm parameters are set as: 


K = 


log( 0 . 16 rne„C) 

log(0.83) 


; and a = C'(log( 6 (-fC + 1) J) + 11 log(n)) for a constant 

2 max{ 16 ,1.2(\/C + new Tnew) }^ 


c > Cadd := 32 • 100" 


( 8 ) 


,CA. 


train ) 


3) (Subspace change) Model 2.2 on £t holds; 


4) (Changes in the missing/corrupted sets %) Model 2.3 on % holds or its generalization, Model 5.1 (given in 
Section holds; 

5) (Denseness and bound on s, tq, r^ew} the bounds in Q hold with 2s(ro + Jraew)h < 0.09n and 2sraewh < 
0.0004n; 

Then, with probability at least 1 — n~^^, at all times t, 

1) \\L - (-th < 1-2 (\/C + V^newTnew) 

2) the subspace error SE^ := ||(7 — PtPt)Pt \\2 ^ bounded above by for t G [tj + 


Proof: The proof is given in Sections |V^ and VII As shown in Lemma |5.2[ Model |2.3| is a special case of 
Model o (Model is more general) on %. Hence we prove the result only using Model |5.1[ ■ 


Theorem 2.5 says that if an accurate estimate of the initial suhspace is available; the two algorithm parameters 


are set appropriately; the ifs are mutually independent over time and the low-dimensional suhspace from which 
£t is generated changes “slowly” enough, i.e. (a) the delay between change times is large enough (d > {K + 2)a) 
and (b) the eigenvalues along the newly added directions are small enough for d frames after a subspace change 
(so that Q holds); the set of missing entries at time t, Tt, has enough changes; and the basis vectors that span the 
low-dimensional subspaces are dense enough; then, with high probability (w.h.p.), the error in estimating £t will 
be small at all times t. Also, the error in estimating the low-dimensional subspace will be initially large when new 
directions are added, but will decay to a small constant times ^/C within a finite delay. 

Consider the accurate initial subspace assumption. If the training data truly satisfies m* = £t (withouf any noise 
or modeling error) and if we have af leasf tq linearly independenf p’s (if p’s are continuous random vectors, this 
corresponds to needing ftrain > ?’o almost surely), then the estimate of rangeobtained from training data will 


actually be exact, i.e. we will have dif(Ptj 


= 0. The theorem assumption that dif(Pij^^;^, Ptt„iJ < roC 


allows for the initial training data to be noisy or not exactly satisfying the model. If the training data is noisy, we 
need to know ro (in practice this is computed by thresholding to retain a certain percentage of largest eigenvalues). 
In this case we can let be the ro-th eigenvalue of A rritmt and At„in be the ro top eigenvectors. 

The following corollary is also proved when we prove the above result. 


Corollary 2.6. The following conclusions also hold under the assumptions of Theorem 2.7 with probability at least 
1 — 

1) The estimates of the subspace change times given by Algorithm satisfy tj < ij < tj -|- 2q;, for j = 1,... ,J; 

2) The estimates of the number of new directions are correct, i.e. fj^new,fc = 'fj,new far j = 1,... ,J and k = 

3) The recovery error satisfies: 


\\P — P\\2 < < 


1.2{faC+y/¥ 

new Tnew) t £ 

1.2 (1.84VC + (0.83)^“V^new7new) t G [tj + (k - !)«, ij + ka-l] , k = l,2,...,K 
2.4y^ t G [tj -f Ka,tj^\ — l] ; 
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4) The subspace error satisfies, 

1 

SEt < 


t e 

10-2 VC + 0.83''-^ t e [ij + {k- l)a, ij + ka - l] , k = l,2, 
10-2 VC t £ [ij + ECa, tj+i — l] . 


,K 


E. Main Result for Online Robust PCA 

Recall that in this case we assume that the observations mt satisfy rrit = It+ Xt with the support of Xt, denoted 
Tt, not known. We have the following result for Algorithm given and explained in Section IV This requires two 
extra assumptions heyond what the previous result needed. For the matrix completion problem, the set of missing 
entries is known, while in the robust PCA setting, the support set, Tt, of the sparse outliers, Xt, must be determined. 
We recover this using an ell-1 minimization step followed by thresholding. To do this correctly, we need a lower 
bound on the absolute values of the nonzero entries of Xt. Moreover, Algorithm has two extra parameters - C^ 
which is the bound on the two norm of the noise seen by the ell-1 minimization step, and w, which is the threshold 
used to recover the support of Xt. These need to be set appropriately. 

Theorem 2,7. Consider Algorithm ^ Assume that mt satisfies Q and assume everything else in Theorem 2.5 
Also assume 

1) The two extra algorithm parameters are set as: C = V^newTnew + (\/r + V^new)\/C cu = 7C 

2) We have Xmin := mint mini.(3,^)^^o I 
Then with probability at least 1 — n~^^, 


1) all conclusions of Theorem 2.5 and Corollary 2.6 hold; 

2) the support set Tt is exactly recovered, i.e. Tt = T for all t; 

3) \\xt — Xt \\2 = \\T ~ -^tlb tind \\it — £t \\2 satisfies the bounds given in Theorem 


2.5 


and Corollary 


2.6 


The second assumption above can be interpreted as either a lower bound on Xmirn or as an upper bound on 
V^newTnew in terms of Xmin- This latter interpretation is another “slow subspace change” condition. For the a^t’s, 
this result shows that their support is exactly recovered w.h.p. and its nonzero entries are accurately recovered. 


F. Simple Generalizations 

Model on If Consider the subspace change model. Model |2.2| For simplicity we put a slow increase model on 
the eigenvalues along the new directions for the entire period [fj,fj+i). However, as explained below the model. 


train — ^new — ■^new — 


the proof of our result does not actually use this slow increase model. It only uses Q, i.e. A 
^^train- Recall that A-g^ and A+g^ are the minimum and maximum eigenvalues along the new directions for the 
first d frames after a subspace change. Thus, in the interval [tj + d + 1, fj+i) our proof actually does not need any 
constraint on A* new 

With a minor modification to our proof, we can prove our result with an even weaker condition. We need Q to 
hold with A-g„ being the minimum of the minimum eigenvalues of any a-frame average covariance matrix along 
the new directions over the period + d], i.e. with A" „ = min^ mm.^^it,,t,+d-a] Amin(^ E[=r £^t,new)- For 
video analytics, this translates to requiring that, after a subspace change, enough (but not necessarily all) background 
frames have ‘detectable’ energy along the new directions, so that the minimum eigenvalue of the average covariance 
is above a threshold. 

Secondly, we should point out that there is a trade off between the bound on qivfi, and consequently on A+g^, in 


Model 2.2 and the bound on p^jd assumed in Model 2.3 Allowing a larger value of qtVi (faster subspace change) 
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will require a tighter bound on p^l3 which corresponds to requiring more changes to Tt- We chose the bounds 
^ 3 and p^fi < .Ola for simplicity of computations. There are many other pairs that would also work. The 
above trade-off can be seen from the proof of Lemma |6.14[ The proof uses Model |5.1| of which Model |2.3| is a 


special case. For video analytics, this means that if the background subspace changes are faster, then we also need 
the foreground objects to be moving more so we can ‘see’ enough of the background behind them. 

Thirdly, in Model 2.2 we let PtAtPt be an EVD of This automatically implies that A* is diagonal. But our 
proof only uses the fact that A* is block diagonal with blocks A^^* and Aj new If we relax this and we let PtAtPt 
be a decomposition of 5]^ where At is block diagonal as above, then our model allows the variance along any 
direction from range(Pi^_i) to become zero for any period of time and/or become nonzero again later. Thus, in 
the special case of Q we can actually allow Pt = [{Pt-iRt \ Pt, del) -Ft,new]> where Rt is an rj_i x rj^i rotation 
matrix and Pt,dei contains the columns of Pt-iRt for which the variance is zero. This will be a special case of 
this generalization if [Pti-i Pti,ne^ Pt 2 ,new ■ ■ ■ Ptj,new] is a basis matrix. 

Initialization condition. The first condition of the theorem requires that we have accurate initial subspace 
knowledge. As explained below the theorem, this means that we can allow noisy training data. Moreover, notice 
that if we let ti = /train + then new background directions can enter the subspace at the same time as the first 
foreground object. Said another way, all we need is an accurate enough estimate of all but rnew directions of the 
initial subspace, and an assumption of small eigenvalues for sometime (d frames) along the directions for which 
we do not have an accurate enough estimate (or do not have an estimate). 

Denseness assumption. Consider the denseness assumption. Define fhe (un)denseness coefficient as follows. 

Definition 2.8. For a basis matrix P, define Ks{P) := max ||77-'i^||2. 

Notice that left hand side in ([T]) is [ki(P)]^. Using the triangle inequality, it is easy to show that Ks{P) < 
y/sKi{P) ifT^ . Therefore, using the fact that for a basis matrix [Pi P 2 ], (ks([Pi P 2 ]))‘^ < {ks{Pi))^ + (ks(P 2 ))^ 
(see proof of the first statement of Lemma C.2 in Appendix |^, the denseness assumptions of Theorem |2.7| imply 
that 


:= K2s{Ptj) < 0.3 and 


'^s,new 


:= maxK2s(Pi 


tj.Yvew) ^ 


< 0 . 02 . 


(9) 


The proof of Theorem |2.7| only uses Q for the denseness assumption. 
The reason for defining Kg as above is fhe following lemma from |[T^ . 

Lemma 2.9 ( lfT2l ). For a basis matrix P, 6s{I — PP') = {ks{P))^. 


Lower bound on minimum nonzero entry of Xt in the online RPCA result (Corollary 2. 71. For online RPCA, 
notice that our result needs a lower bound on the minimum magnitude nonzero entry, Xniin> of the outlier vector xt. 
This may seem counter-intuitive, since it means that outlier magnitudes need to be large enough for the proposed 
algorithm to work whereas one would expect that smaller corruptions are easier to deal with. This is actually true 
in our case as well, and the lower bound on minimum nonzero entry of xt is an artifact of trying to use a simpler 
model and a simpler proof approach. As we explain next, what we really need is that the corruptions either be 
small enough (to not affect subspace recovery too much) or be large enough (to be detectable). 

Corollary 2.10 (No lower bound on outlier magnitude). Consider Algorithm^ Assume that rrit satisfies Q. Assume 
that the following hold: 

1) Suppose that Xt and it are mutually independent; and there exists a partition ofTt into It,Urge, Tt, small that 
• mint miujeT; “ 14maxt ||(®t)rt.,„,iill 2 > 14(^rnew7new(d) + iV^ + ^/rnew)^/C) 


and maxt ||(®t)rt..„aiill 2 ^ 0.03CA 


train 
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Let := maXi \\{xt)Tt,.^^nh- 
2) Algorithm parameters are set as £, = ew + \/^new 7 new + 


;)VC; w = 76- k = 


a 


> 32 • 1002 max{16,1.2(v^+y^W+e„)n (log(g(X + 1 )^) + H logw); 

■C\, 


log( 0 . 16 rnewC) 

log(0.83) 


; and 


' newS^Hrain I 


3) Everything else in Theorem 2.5 holds with Tt replaced by large- 

Then, with probability at least 1 — n~^^, the support set of the large entries of xt, 7t,large, exactly recovered at 
all times, \\xt — Xt \\2 = \\£t — £t \\2 < 1-2 [s/C + -y/rnew 7 new + ^w) and and all other conclusions of Theorem 
hold. 


2.5 


Proof: The proof will follow in exactly the same fashion as the proof of the original theorem. We will 
just need to treat {xt)Tt extra “noise” term and use one of the following three facts at various places. 

Let E[.] denote expectation conditioned on accurate recovery so far and on Tt (this is formally defined in the 
proofs). We will use (a) \\{xt)rt,„„,.uh < < \/ 0-03C\"ain; (b) = 0 (this follows because 

it is zero mean and it and Xt are independent (and hence it and {T, {xt)T^ are independent)); and (c) 
P[(®t)T,™an(®t)T,=„.n']l|2 < maxt ||(xt)r*.,_n(®t)T..„.u'l|2 < 4 < 0.03C\",^in- ■ 

III. Discussion 


A. Discussion of the assumptions used 


In the previous section, we provide two related results, one for online matrix completion (MC) and the second 
for online robust PCA (RPCA). The result for online RPCA can also be interpreted as a result for online sparse 
matrix recovery in (potentially) large but structured noise it. Notice that our result does not require an upper bound 
on A'*' (the maximum eigenvalue of Cov(£t) at any time) or on 7 (the bound on the maximum magnitude of any 
entry of Plit for any time t). Both these parameters are only used to select C,, which in turn governs the value of 
K and a and hence governs the required delay between subspace change times. 

Our results require accurate initial subspace knowledge. As explained earlier, for video analytics, this corresponds 
to requiring an initial short sequence of background-only video frames whose subspace can be estimated via SVD 
(followed by using a singular value threshold to retain a certain number of top left singular vectors). Alternatively 
if an initial short sequence of the video data satisfies the assumptions required by a batch method such as PCP (for 
RPCA) and NNM (for MC), that can be used to estimate the low-rank part, followed by SVD to get the column 
subspace. For online MC, another alternative is to use the initialization techniques of GROUSE IS] or PETRELS 
lj9l or to use the adaptive MC idea of ITTIl . 

In Model [2^ we are placing a slow increase assumption on the eigenvalues along the new directions, Pt.,new, 
for the interval [tj,tj+i)- Thus after fj+i, the eigenvalues along Pt-^new can increase gradually or suddenly to any 
large value up to A+. In fact as explained above, our proof needs the slow increase to hold only for the first d time 
instants after tj, so, in fact, at any time after tj + d, the eigenvalues along Pt^,new could increase to a large value. 

Model 2.3 on 7i is a practical model for moving foreground objects in video. We should point out that this 
model is one special case of the general set of conditions we need (Model |5.1[ ). Some other special cases of it are 
discussed in Section [K] 

The model on Tt (Model [23] ) and the denseness condition of the theorem constrain s and s, rg, rnew, J respectively. 
Model 2.3 requires s < p^nja for a constant p 2 . Using the expression for a, it is easy to see that as long as 
J G 0{n), we have a G C>(logn) and so Model 


2.3 


condition will hold if ro G C>(logn), J G C>(logn) and r. 
that we allow on the rank-sparsity product. 


needs s G With s G the denseness 

new is a constant. This is one set of sufficient conditions 
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B. Comparison with the results for PCP and NNM 

Let L := [£i,£ 2 , • • •, and S := [xi,X 2 ,..., Let r^at := rank(L). Clearly r^at < ro + Jvnew and 

the bound is tight. Let Smat := ^maxS be a bound on the total number of missing entries of L or on the support size 
of the outliers’ matrix S. In terms of r^at and Smat, what we need is r^at £ C>(logn) and Smat £ ^( 7ogn )• 
is stronger than what the PCP result from ||2l or the result for NNM from @ need (e.g., the PCP result from O 
allows Tmat £ O ^ ^ while allowing Smat £ 0{ntma.x)), but is similar to what the PCP results from l|3l, iffl 

need. 

Other disadvantages of our result are as follows. (1) Our result needs accurate initial subspace knowledge and 
slow subspace change of It. As explained earlier and in ifT^ Fig. 6 ], both of these are often practically valid 
for video analytics applications. Moreover, we also need the Efs to be zero mean and mutually independent over 
time. Zero mean is achieved by letting £t be the background image at time t with an empirical ‘mean background 
image’, computed using the training data, subtracted out. The independence assumption then models independent 
background variations around a common mean. As we explain in Section [I^ this can be easily relaxed and we 
can get a result very similar to the current one under a first order autoregressive model on the £t’s. (2) Moreover, 
Algorithms [T] and need multiple algorithm parameters to be appropriately set. The PCP or NNM results need this 
for none ||2l, || 6 l or at most one Q, IH algorithm parameter. (3) Thirdly, our result for online RPCA also needs a 
lower bound on Xmin while the PCP results do not need this. (4) Moreover, even with this, we can only guarantee 
accurate recovery of £t, while PCP or NNM guarantee exact recovery. 

(1) The advantage of our work is that we analyze an online algorithm (ReProCS) that is faster and needs less 
storage compared with PCP or NNM. It needs to store only a few n x a or n x r^at matrices, thus the storage 
complexity is 0{nlogn) while that for PCP or NNM is 0(nt max ). In general fmax can be much larger than logn. 
(2) Moreover, we do not need any assumption on the right singular vectors of L while all results for PCP or NNM 
do. (3) Most importantly, our results allow highly correlated changes of the set of missing entries (or outliers). 
From the assumption on Ti, it is easy to see that we allow the number of missing entries (or outliers) per row 
of L to be C>(fmax) as long as the sets follow Model 2.2 ^ The PCP results from @, ||4l need this number to be 


which is stronger. The PCP result from lO or the NNM result 0 need an even stronger condition - they 
need the set (U*"“ 7 t) to be generated uniformly at random. 


C. Other results for online RPCA and online MC 

Our online RPCA result improves upon the online RPCA results from our earlier work lIT^ for two reasons. 
First, the result of l[T2ll was a partial result because it required a denseness assumption on (7 — Pt-^newPtj,new')Pt 
and (7 — Pt^^Pt,*' — Pt,newP,new')Ptj,new Here Pi,* and P,new are estimates computed by Algorithm Thus, 
the result depended on intermediate algorithm estimates satisfying certain properties. In this work, we remove this 
requirement and instead provide a complete correctness result. The extra assumption that we need is Model |2.3| on 


7i (or its generalization given in Model |5.1| later). Secondly, we provide a correctness result for a ReProCS-based 
algorithm that detects subspace change automatically and also estimates the rank of the new subspace automatically. 
The algorithm studied in |[T2l required knowing tj and r^ new exactly for each j. Algorithms [T] and in this work 
only require upper bounds on rnew, 7new and J (these are needed to set the algorithm parameters - a and K for 
Algorithm [I] and also ^ and oj for Algorithm and a small enough (need bounds on r, A+ and 7 to set this). 
A third minor advantage is that we also provide an algorithm and a result for online MC. 

‘'in a period of length a, the set 7t can occupy index i for at most p/3 time instants, and this pattern is allowed to repeat every a time 
instants. So an index can be in the support for a total of time instants and the model assumes p/3 < ' 


for a constant p. 
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The proof of our results adapts the overall framework developed in lIT^ . The two important additions are: (a) 


Model 5.1 and Lemma 5.3 for it, and the way it is used in the proof of Lemma 6.23 and (h) the detection lemma 


(Lemma 6.17), the no false detection lemma (Lemma 6.16) and the p-PCA lemma (Lemma 6.18) and the lemmas 
used to prove these, (a) allows us to get a complete correctness result; (h) allows us to analyze an algorithm that 
does not use knowledge of tj or new 

In ll20]| . Feng et. al. propose a method for online RPC A and prove a partial result for their algorithm. The 
approach is to reformulate the PCP program and use this reformulation to develop a recursive algorithm that 
converges asymptotically to the solution of PCP as long as the basis estimate Pt is full rank at each time t. Since 
this result assumes something about the algorithm estimates, it is also only a partial result. 

Another recent work that uses knowledge of the initial subspace estimate and performs recovery in a piecewise 
batch fashion is modified-PCP ||2T1. However, like PCP, the result for modified PCP also needs uniformly randomly 
generated support sets. Its advantage is that its assumption on the rank-sparsity product is weaker than that of 
PCP, and hence weaker than that needed by this work. A detailed simulation comparison between modified-PCP, 
ReProCS and PCP demonsfrafing bofh fhese fhings is available in ll^ Fig. 6]. 

Some ofher recenf works fhat also study the online MC problem (defined differenfly from how we define it) 
include iH, Grassmanian Rank-One Update Subspace Estimation (GROUSE) l[8l and Parallel Subspace Estimation 
and Tracking by Recursive Eeast Squares Erom Partial Observations (PETREES) Q. We discuss the connection 
with Q in Section |I^ GROUSE is a first order stochastic gradient method. It uses rank-one updates to track the 
underlying subspace on the Grassmannian manifold. A result for its convergence to the local minimum of the cost 
function it optimizes is obtained in ifTOl . PETREES is a second order stochastic gradient method. As explained 
in 0, in PETREES, the low-dimensional subspace is tracked by minimizing a geometrically discounted sum of 
projection residuals on the observed entries at each time index. If missing entries are required then they can be 
reconstructed via least squares estimation. The subspace is updated recursively so that it is not necessary to retain 
historical data indefinitely. If the underlying subspace is fixed and the data stream is fully observed, then it is shown 
that the PETREES estimate converges to the true subspace. In general, it always converges to the stationary point 
of the cost function it optimizes 0. The advantage of PETREES and GROUSE is that they do not need initial 
subspace knowledge. Eor our algorithms, when the initial subspace knowledge is not available or initial complete 
and outlier-free data is not available, we can also use the PETREES or GROUSE ideas for initialization. 


IV. Automatic ReProCS Algorithms for Online MC and Online RPCA and Why They Work 
In this section, we first introduce the automatic ReProCS based algorithm for online MC and explain why it 
works (this also provides the key idea why the proof of our main result would go through). Next, we do the same 
thing for the online RPCA algorithm. In the last two subsections (Sections IV-C and |IV-D| ), we explain the key 
insight used by our proof and give the proof outline. 


A. Automatic ReProCS for Online MC (Algorithm^ 

The model on m* from Q is a special case of that from Q with Xt = —Ijilp/it and with the support of Xt, 
Tt known 0. Thus, we can use a simplification of the ReProCS idea for online RPCA itT^ to also solve the online 
MC problem. 

Algorithm [^proceeds as follows. Eet Pt-i denote the basis matrix for the estimate of the subspace where tt-i lies. 
If it is an accurate estimate, because of “slow subspace change”, projecting the measurement rrit = Xt + £t onto its 
orthogonal complement will nullify most of £f Specifically, we compute yt := where := I — Pt-iPt-i ■ 

Thus, yt can be rewritten as 


yt = ^txt + bt where bt := 
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and it can be argued that ||bt ||2 is small. Since the support of Xt, Tt, is known, we can simply recover its nonzero 
entries by least squares (LS) estimation, i.e. we get Xt = I'ji{^t)Tt^yt and then get an estimate of it as it = rrit—Xt. 
The above approach of recovering it is equivalent to that used by Brand in |[71, there they recover it as an LS 
estimate of PP'it ~ it- 

Let Bt := it — it- With the above, it is easy to see that 

et = IrA^thJbt = 


Using the denseness assumption, it can be argued that the RIC of ^t will be small (see Lemma 2.91. Under 
the theorem’s assumptions, and conditioned on accurate recovery so far, we can bound it by 0.14. Thus, 
\\{^t)T/{^t)r!h < 1/(1 — 0.14) < 1.2 and so ||et ||2 < 1.2||bf||2, i.e. it is small too (see Lemma ^ 


6.151. 


Projection-PCA (p-PCA). The next step is to use a modification of standard PCA called projection-PCA (p- 
PCA), to update the subspace estimate. The reason we need p-PCA is this. Let denote a sum over an a length 
time interval. In our problem, the error, Bt, in the observation/estimate of it, it, is correlated with it- Because of 
this, the dominant terms in the perturbation seen by standard PCA, ^ iti/ — ^ 'AAt are ^ itB/ and its 
transpos^ Thus, when the condition number of Cov{it) is large, it becomes difficult to argue that the perturbation 
will be small compared to the smallest eigenvalue of Cov{it)- With a large perturbation, either the sin0 theorem 
S22\ (that bounds the subspace error between the eigenvectors of the true and estimated sample covariance matrices) 
cannot be applied or it gives a useless bound. 

Our proposed approach, projection-PCA (p-PCA) addresses the above issue as follows. Att = tj, let P* := 

Pnew ■= Ptj,new, and suppose that the subspace range(P*) has been accurately recovered, i.e. we have P, so that 
dif(P,, P*) <C 1. Then at a time at or after tj -|- a if we project the a previous it’s perpendicular to P,, we will 
considerably reduce the perturbation seen by the PCA step. We detect subspace change by checking if the maximum 
singular value of the matrix formed by these projected -it’s is above a threshold. Denote the time at which change 
is detected by ij. After ij we use SVD on K different sets of a frames of the projected it’s to get improved 
estimates of the new subspace range(Pnew) in each iteration. To be precise, we get the A:-th estimate, Pnew,k, as 
the left singular vectors of (/ — PP0[^£ +(fc_i)Q+i) ■ ■ • > +A;q] with singular values above a threshold. After 

each p-PCA step, we update p as p = [P Pew.fc]- Finally at time t = ij + Ka, we update P as [P Pew, if]- 

In the subspace update step. Algorithm [T] toggles between the “detect” phase and the “ppca” phase. It starts in 
the “detect” phase. When a subspace change is detected, i.e. at f = tj it enters the “ppca” phase. After K iterations 
of p-PCA, i.e. at f = ij + Ka + 1, the new subspace has been accurately estimated and this time it enters the 
“detect” phase again. 

Why p-PCA works. The reason p-PCA works is as follows. Before the first p-PCA step, i.e. for t G [tj,ij + a), 
Pt = P and thus the noise seen by the projected sparse recovery step, bt = ^it = (/ — P^PA)it, will be the 
largest. Hence the error e* will also be the largest for the it’s used for the first p-PCA step. However because of 
the projection perpendicular to P and slow subspace change, even this error is not too large. Because of this and 
because Bt is sparse and supported on 7/ and Tt follows Model 


2.3 


we can argue that Piew,i is a good estimate. 


i.e. dif([P* P 


new,lj 5 new> 


< 0.2 < 1. After the first p-PCA step, p = [p P 


new,l. 


and this will reduce bt and 


hence Bt for the it’s in the next a frames. This and the sparseness of Bt, in turn, will mean that the perturbation 
seen by the second p-PCA step will be smaller and so Pew ,2 will be a more accurate estimate of range(Pnew) than 

) < ^'newC- By the theorem assumptions. 


-Pew,i- This is done K times with K chosen so that dif([P* Pew,if], P 


^When £t and et are uncorrelated and one of them is zero mean, it can be argued by law of large numbers that, whp, these two terms 
will be close to zero and ^ Tt BtS-t will be the dominant term. 
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Algorithm 1 ReProCS for Online MC 

Parameters: a, K, Inputs: ^train’ Output: it, Pt, tp rj^new,k 

Let thresh = (this is the eigenvalue threshold that will be used to detect subspace change). 

Set Pi,* ^ Pit„i„, Pi,new [■], i 0, phase detect. 

For every t > ftrain^ do the following: 

• Compute yt ^ where $i — Pi_iPi_i' 

• Estimate ip it ^ rrit - 

• If f mod Q: 0 then Pt^^ ^ Pt—\^-^, Pt^new ^ l,new) Pt ^ .^,new] 

• If f mod a = 0 then detection or projection PCA 
If phase = detect then 

1) Set tt = ^ and compute = (/ - Pua-l,*Pua-l,*')[i(u-l)a+l, ■ ■■^ua] 

2) Pt,* ^ Pi—1,*> Pt,new ^ .fi—l,new^ Pt ^ [-^4,* .ft,new] 

3) If Xmnxi^PuPj) > thresh then 
phase ppca, j ■(— j + 1, A; •(— 0, fj = f 

Else (phase = ppca) then 

1) Setu = X and compute = (/ - P„„_i,*P„„_i,*')[^(«-i)a+i, ■ ■■iua] 

2) Pt,new ^ eigenvectors , thresh^, Pt,* i Pt—\,*, Pt ^ .^,new] 

3) k ^ k set rj Yiew,k — rank(.^,new) 

4) If k = K, then 

phase detect, Pt,* •(— Pt, Pi,new ^ [•] 

eigenvectors(A4, thresh) returns a basis matrix for the span of all eigenvectors whose eigenvalue is above thresh. 


and because we can show tj < tj < tj + 2a (we explain this below), it is clear that tj+i > tj + Ka. Thus, the 
new subspace added at tj is accurately estimated before the next change time fj+i. 

Why ij are correctly detected. As explained above, we detect subspace changes by comparing the eigenvalues of 
{I —P^,Pj)X itit {I— P*P*') to a chosen threshold at every f = ua for u = 1 , 2 ,..., when the algorithm 

is in the “detect” phase. In order to correctly detect tj, the algorithm first must not falsely detect new directions 
when none are present and it must detect subspace change within a short delay after it has occurred. The former will 
occur because conditioned on accurate recovery of the current subspace, {I — P,P,')d ~ P*P*) will 

have very small eigenvalues when no new directions are present. If the recovery were exact and no new directions 
present, this matrix would be zero. In our case, the recovery is only accurate and so we show that all eigenvalues of 
this matrix will be below the chosen threshold (see Eemma 6.16| ). Next consider detection of the subspace change 
after it has occurred. When u = Uj := ^ , i.e. when tj is in the interval [{u — f)a + l,rta], not all of the it’s 
in this interval will contain new directions. Thus, depending on where in the interval tj lies, the algorithm may or 
may not detect the subspace change. However, in the next interval, [uja + 1, (uj + I)®], all of the it’s will contain 
new directions, and we can prove that the subspace change will be detected w.h.p. (see Eemma 6.171. Thus, w.h.p., 
either ij = Uja, or ij = {uj + l)a. Thus, we will be able to show that tj < ij < tj + 2a w.h.p.. 

A visual description of Algorithm [T] is shown in Eig. This figure uses Definition 6.4 


B. Automatic ReProCS for online RPCA (Algorithm ^ 

Eor online RPCA the only difference is that the support for Xt, Tt, is not known. Hence we first recover Xt by 
ell-1 minimization (or any other sparse recovery method) and then estimate its support by thresholding. The rest 
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Algorithm 2 ReProCS for Online RPCA 

Parameters: a, K, u, Inputs: -^toain’ Output: it, Pt, tj 

Let thresh = Set A,* ^ Pt,new ^ [•], J ^ 0, phase ^ detect. 

For every t > ttrain> do the following: 

• Estimate 7t (the support of the outlier vector Xt) and Xt. 

1) compute yt ^ where — Pt-iPt-i 

2) solve mina; ||aj||i s.t. \\yt — ^ta ;||2 < i, and let xt^a denote its solution 

3) compute ft = {i : > w} 

4) LS estimate of xp compute Xt = If^{{^t)f_)ft 

• Use all steps of Algorithm with f f. 


of the steps remain the same as above. 


C. Key Insight for the Proof 


The argument given while explaining why p-PCA works in Section IV can be formalized to show that, w.h.p., 
a subspace change is detected only after a change has occurred and within 2a frames of the change; and that 
the subspace recovery error, SE^, will decay roughly exponentially with each p-PCA iteration and become small 
enough after K iterations. To do this we will use the sin0 theorem l[22ll (Lemma 6.201 followed by the matrix 
Hoeffding inequality (Lemmas 7.5 7.6 1 ) to get high probability bounds on each of the terms in the subspace 
error bound obtained by the sin 6 theorem. 

While applying the matrix Hoeffding inequality, we need to use the following key insight about the structure 
of E[d — P^fPj)ite.t]. This matrix is the dominant term in the perturbation seen by the /c-th p-PCA step. 

Here E[.] denotes expectation conditioned on accurate subspace recovery so far and Yht denotes tbe sum over 
f G [tj + {k — l)a + l,tj + ka\. The model on Tt and the fact that e* is supported on Tt can be used to show 
that this matrix can be written as the product of a full matrix and a block-banded matrix: for example when 
p = 1, the block-banded matrix will be block-diagonal, when p = 2, it will be block-tridiagonal, and so on. Also, 
E[^ St ^t^-t] will be a block banded matrix. Tbe 2-norm of a block banded matrix is bounded by the maximum 
norm of any block times the number of bands in it and hence is much smaller than that of a general full matrix. 


The lemma that exploits the structure of a block-banded matrix generated due to the model on 7t is Lemma 5.3 


given in Sec|^ This lemma is used to bound E[^ St(-^ “ P*PJ)it^t] and E[^ Yht ^t^t] in the proof of Lemma 
16.231 in Section Ivnl 


D. Proof Outline 


We will only prove Theorem 2.7 Theorem 2.5 follows as a corollary of Theorem 2.7 because of the following 
reasons. (1) Algorithm does not compute Xt or its support f - Lor the matrix completion problem, f is given. 
Thus it does not use the parameters ^ (which is the noise bound in the ell-1 minimization step) and uj (which is 
the support estimation threshold). (2) The bound on Xmin and the values of the parameters ^ and io are only used 
in the proof of Lemma 6.15 to show exact support recovery, i.e f = %■ Since for matrix completion 7* is given. 


Theorem |2.5| does not need need the lower bound on Xmin- 

The proof of Theorem |2.7| is given in Sections VI and VII Before this, in the next seetion (Section |V]) we give 


the most general model on changes in the missing/outlier entries’ set f, Model 5.1 and we show that Model 2.3 
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is a special case of this model. Next, we give a key lemma for sums of sparse matrices supported on rows and 
columns indexed by Tt satisfying this model (Lemma |5.3| ). 

Section |V^ begins with defining various quantities needed for the proof. Next, we state the main lemmas used 
to prove the theorem, followed by the theorem’s proof. There is a main lemma associated with each of the three 
main tasks of the algorithm: 1) accurately recovering Xt and hence It at each time t (Lemma |6.15[ ), 2) detecting 
(subspace change) when and only when the subspace has changed, i.e. new directions have been added to the 
subspace (Lemmas 6.17 and 6.16| ), and 3) successfully estimating the dimension of the new subspace and updating 
its estimate by p-PCA (Lemma 6.18| ). To maintain the flow of the argument, we defer the proofs of these lemmas 
to the end of the section or to the appendix. 

The proofs of Lemmas 6.21 6.22[ and 6.23| that are used together to prove Lemmas 6.17 6.16| and 6.18 
rather long and are given in section VII The proof of Lemma 6.23 uses Lemma 5.3 from Section [V| 


are 


V. Most General Model on Changes in Tt and a Key Lemma 
A. Most General Model on Changes in Tt 

Here we give our most general model on how Tt (the set of missing entries or the support set of Xt) can change. 
What we need to prevent is Tt occupying the same indices for too many time instants in a given interval. If Tt does 
not change ‘enough’ in a time interval of length a, we will be unable to see enough entries of a given index of T 
in order to be able to accurately fill in the missing ones. The following model quantifies ‘enough’ for our purposes. 
The number of time instants for which an index is part of Tt is determined both by how often this set changes, 
and by how much it moves when it changes. The latter is parameterized by p which controls how much the set 
moves when it changes. For example p = 1 would require that distinct sets be disjoint, and p = 2 would mean 
that at least half of the set is displaced whenever it changes. The parameter G (0,1) represents the maximum 
fraction of time for which the set remains in a given area in a time interval of length a. The smaller /i+, the more 
frequently the set will need to change in order to satisfy the model. Our result requires a bound on the product 

p2/i+. 

Model 5.1. Let p be a positive integer. Split [l,fmax] into intervals of length a. Use Ju '■= [(u — l)a + l,ua] 
to denote the u-th interval. For a given interval, Ju, let T(i)^u faf i = - ■ ,iu be mutually disjoint subsets of 

{1,..., n} and let * = 1, 2,..., („ a partitioi^of the interval Ju that 

for all t G Jtipu’ Tt C Tipyu O U • • • U (10) 


Define 




( 11 ) 


and define hu{a) which takes the minimum over all choices ofT{^i)^u nnd over all choices of the partition J(ipu- 

K{T)-.= „ -10 , «}*=!,...,/„) ( 12 ) 

all choices oi mutually disjoint ^ = 1, tu 

and all choices of mutually disjoint = 1, 2,... /^ 

so that = <Ju and {D holds 


Assume that \Tt\ < s and that for all u = 1,..., 


Ki{(y.) < h'^a with h~^ < 


0.01 


®i.e. the JpJs are mutually disjoint intervals and their union equals Ju 

































19 


In the above model, h’Ka) provides a bound on how long % remains in a given “area”, U u • • • u 

7(i+p_i)^u during the interval J'u, for the best allocation of 7t’s to a given “area” and the best choice of the “areas.” 

Notice that ( fTO] ) can always be trivially satisfied by choosing = 1, T{i),u = {1; • • • ^^d i7(i),u = J^u, but 

this will give hu{a] .) = a and hence is not a good choice. This is why we take a minimum over all choices. 


Lemma 5.2. Model 


2.3 


is a special case of Model 


5.1 


above with h'^ 



The proof is in Appendix 


Some other special cases of the above model are discussed in Section IX 


B. A Key Lemma that uses Model 5.1 


Lemma 5.3. Let st = \7t\- Consider a sequence of st x st symmetric positive-semidefinite matrices At such that 


IIII 2 < <7^ for all t. Assume that the Tt obey Model 
n X n identity matrix). Then 


5.1 


Let IVL = ^ ^ Ip^Aflpf be an n x n matrix (I is an 


A [\\2 < p^h'^aa'^ < 0.01a~^a 


Proof: We will first prove the lemma for the special case when p = 2. After this, we will show how to 
generalize the proof when p > 2. For a given u, let i = 1, 2,... and correspondingly denote the 

best choices, i.e. the choices that attain the minimum values in the definition of /i^(a). 

In the rest of the proof, we remove the subscript u from and from 7(i)_u’s for ease of notation. For simplicity 
of notation, we will let 7(;+i)_u = 0- 

For times t G define Atyuii to be At wifh rows and columns of zeros appropriately inserted so that 

Ir.Atlr' = ( 13 ) 

Such an f^u exists because % C 7(j) U 7(j+i) for any t G J(i)^u- Notice that 

IIAiyullIb = ||Ai||2, (14) 

because A^ fuu is permutation similar to 

’ Af 0 ' 

0 0 


Since 7(i) and Tp+i) are disjoint, we can, after permutation similarity, correspondingly partition A^ fun as 


A 

A 


( 0 , 0 ) 

i,full 

( 1 , 0 ) 


A 

A 


(0,1) j 

( 1 , 1 ) 
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for all t G Notice that because At is symmetric, Then, 

M=Y, ItAiIt: 


tej^ 
l 

XI X by ([13]) 

i=l 

I r ^(0,0) wo,i) -\ Y , 

^T(i) 

It' 


X X In 

i=l 


+1)J 


4(0.0) .(0,1) 


4(1.0) 4(1.1) 


^ ^ [j-r ' + It It ' + It It ' + It It ' 

i=l 


iT,„\ E <“,1)^^./+E 

;-i 




(o 


+E 

2=1 


E 4a + E 4a 1^7^.,' 


^7^.1 E 4aaw + ^7;„j E 4aUr„/ 


+ it„, I E 4 ai I ^T„: 


I isaii),' 






Because 7(i) and 7(fe) are disjoint for i ^ k, M has a block tridiagonal structure (by a permutation similarity if 
necessary): 

-^(1) C(1) 0 0 

C(1)' B(2) 


0 


c, 


where = ZteJn. ,, ^(D = E 


*(i) “ Z^4ej'(i),„ ^t,fuii’ ^(i) — ^t.fuii 

I(1.1) 


(Z-l) 

0 ^(*-1)^ ^(i) 

I(1,1) 


(15) 


E 4,Mi+ E 4tS fori = 2,3....,; 


and 


G.)= E 4°ul fori = l,2,...,i-l. 


(16) 


(17) 


Now we proceed to bound ||iW|| 2 . 


\\Mh = 


B(i) C(i) 

Cd)' ■■■ 


C, 


0 

0 

7 - 1 ) 


0 ^( i - i ) -^(0 



-B(i) 

0 

0 

0 


0 

ql) 

0 

0 


0 

0 

0 

0 

< 

0 


0 

0 

+ 

0 

0 


0 

+ 

C'(i)' 

0 

0 

0 


0 

0 


0 


0 

0 

0 

C ( i - 1 ) 


0 


0 

0 


0 

0 

0 

B ( i ) 

2 

0 

0 

0 

0 

2 

0 

0 


0 





























21 


Call the middle matrix C, and observe that CC is block diagonal with blocks So ||C ||2 = 

maxj 11(7(4)112. Therefore, 


M \\2 


< max 

i 

< max 

i 


< max 
i 


S(4)||2 + 2max ||C(4)||2 


E 

^1mi+ +2 max 

E 

4(0-1) 

/ 

\ 



( ^ 

At| 2 + ^ At |2 -l-2max 

E 

11^*112 






< (cr''“/i* (a) + (a)) + 2a~^hl^(a) < 4a~^h'^a 


by ( [T^ and ( [TT] ) 
by ([T4]) 


The third row used the fact that ||^[fu\il |2 < ||^f,fuii ||2 = ll^tlb for any sub-matrix of Aiyuii- 

This finishes the proof for the p = 2 case. For this case, notice that there are 3 bands in ( [T5| ) - the diagonal band 
and one band on each side of the diagonal one. When p = 3, everything will follow analogously to the above; 
instead of 3 bands, there will be 5 bands in the definition of M and we will be able to bound its norm by 


Af 2 < max j 

2+ 

\\M\2+ Y 2 

|+2max( y] 

\\M\2+ Y ll^ilU 

\te J(i-2),u 





-|- 2 max 

II ^*112 







<3a~^h’^{a) + 2(2cr'''/i* (a) + (a)) < da'^h'^a 


Proceeding this way, for a general p, there will be 1 + 2{p — 1) = 2p — 1 bands. Any term in the central band 
will contain a summation of over p sub-intervals te™ in the first band away from the diagonal 

will contain this summation over (p — 1) sub-intervals; any term in the second band away from the diagonal will 
contain this summation over (p — 2) sub-intervals; and so on. Thus, we will be summing the quantity a'^h'^a a 
total of (p -f 2 *) = times and so we will get 11^^112 < p^a'^h'^a. ■ 


VI. Proof of Theorem I2.7I and Theorem I2.5I 


As explained in Section IV-D we will only prove Theorem |2.7| Theorem |2.5| follows as an easy corollary. 


A. Definitions 

Definition 6.1. Define e* to be the error made in estimating xt and it- That is 


et := Xt - Xt = It - it 

Definition 6.2. Define the interval 

Ju := [{u - l)a + l,ua]. 


Also define Uj to be the u such that tj G That is 


Ui := 


For the purposes of describing events before the first subspace change, let uq := 0. Also define 

h 

Uj := —. 

a 

Notice from the algorithm that this will be an integer, because detection only occurs when t mod a = 0. 
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Pf = P< 


U),‘ 




P = R,.. P 


0').* Mi),new,1 


P= 


[■^{i),* -^(i),new,fcj 


Pt — -P(i),* -P(i),new,ir — P{j+l),t 




tj 2q; tj “h hey tj {^k tj Ka tj d 




-*'t,new||oo ^ Tnew 


Fig. 3: A diagram to visualize Algorithm and Definition 6.4 The A:-th projection-PCA step (at t = ij + ka) 
computes the top left singular vectors of (/ - P(j),*P(j),/)[4+(fc-i)a+i> 4+(fc-i)«+2’' • •4+fc«]- 


We will show that, under appropriate conditioning, w.h.p., uj = Uj or Uj = Uj + 1. 

Definition 6.3. Define 


^U) ■= for j = 0,1,..., J 

• ■^(j —1) Ond F*(j),new • Ptj,new for j 1, . . . ,J 

(It,* ■ — P{j),* Cind CliYie'W • — P{j),ne-w for t G 


Thus, for t G It can he written as 




^t,new 


P{j),*0-t,* + P{j) 


,new®t,new 


and Coy {it) = can he rewritten as 




-1 

> 

o 



0 .^t,new 


(j),new 


Definition 6.4. For j = 1,2,..., J and k = 1,2,..., K define 

1) := -Pitrain initial estimate) and P(j)^:t ■= Pf. ^+Ka- fo oil subspace changes are correctly detected, 
this is the final estimate of = P(^j_\y 

2) P(j),new,o •= [•] ond P(j),new,fc := Pf+ka new estimate of (again, conditioned on 

correct change time detection). 


Notice from the algorithm that 

1) Pi,* = P(i),* ^ ^ fe-i + ij + Ka — 1] 

2) Pi,new P(j),new,k f F 

3) At all times Pt = [Pt,* Pi,new]- Thus Pt and Pt,new update at every t = tj + ka, k = 1,2,..., K, j = 
1,2,... ,J while Pi,* updates at every t = ij-i + Ka, j = 2,... ,J. 

4) Pt—l,^ 3- Pt,new Ut f tj -f ka and so P^jy,,, _L P(^jynew,k 

5) ^i (F Pq-^,*Pq'^,* P(j),new,kP(j),new,k ) whcn t G 2Ttij + [k+l)’ ^ 1) 2, . . . K 1. 

6) $i = (/ — Pq),*Pq),*') when t G [tj,ij + a] (recall that ij = Uja). 

1) $i = (7 - Pq+i),*Pq+i),*') when t G [ij +Ka + l,tj+i - 1]. 

Using the notation from the above definition. Figure summarizes Algorithm [T] 

Definition 6.5. Recall that for basis matrices P and Q, dif {P,Q) := ||(/ — PP')Q|| 2 . Define 

1) Cj,* ■= ^^dP{j),*i Pij),*) 

2) Cj ,new,/c • P(j),new,k ]) P{j),new') 

Recall SEi = dif(Pi,Pi). Notice that if subspace change times are correctly detected, for t G 37u^+fc, SEi < 
Cj,* + Cj,new,fc-1 for k = 1,2,... K; for t G [tj, ij + a], SEi < 1; and for t G [ij + Ka + 1, tj+i - 1], SEi = Cj+i,*- 
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Definition 6.6. Define 

1) Cj> := + (j - l)rnew)C 

2) C~^ •= 1 A+ . 

' ’j,new,0 ■ ’ ’j,new,fc ‘ 


b-H,k 


for k = where b^, &a,±> are defined in 


bA - bA,± - b-H,k 

Lemmas 6.21\ 6.22 and \6.23 Their expressions use e given by 
We will show that these are high probability upper bounds on * and Cj,new,fc under appropriate conditioning. 

As we will see later, hA ~ A-„, 6a,± ~ and bu,k ~ 2^Jp^h+(t>+{C,l^^\+ + Cjtnew,fc-i^new)- Here « 

means we are giving only the most dominant term for each expression. Thus, 

_ 2v^</)+(c+,ew,fc-i^tw + 

^ Anew - C+ 2A+ - 2V^</-+(Ciew,fc-l^new + C+ 2A+) ' 

By using ([^, the hounds on C, from the theorem, and the hound on p^h'^, one can show that this decays roughly 
exponentially with k (see Lemma 6.14). 

Definition 6.7. Define the random variable 


Xu ■— {® 1 ) • ■ • ) ^ua\ 


Definition 6.8. Recall the definition of T^u from Algorithm For j = 1,..., J, k = 1,..., K, and for a = Uj or 
a = Uj + 1, define the following events 

, DET“ := {uj = a} 

• PPCAj^^ . 0, and rank(.P('j'j new,A;) ^j,new Cj,new,A: ^ Cynewifcl" 

• NODETS“ := [uj = a and Amax < thresh for all u £ [uj + K -[- 1, Uj+i — 1]} 

• To, end := {Cl,* < "roC} n {Amax {^T>vVJ) < thresh for all u G [l,ui - 1]} 

• r“o := Tj-l^end n DET“ 

. r“,:=r“fc_inPPCA“, 

• Tj-end := n NODETSJ^) u n nodets“^+^) 

We misuse notation as follows. Suppose that a set E is a subset of all possible values that a r.v. X can take. 
For two r.v.s’ {X,Y}, when we need to say “X G P and Y can be anything” we will sometimes misuse notation 
and just say “{X,Y} G P.” For example, we sometimes say X^^ G Pj,end- This means X^^-i G Pj,end at for 
t G jfuj are unconstrained. 


Definition 6.9. Define 

1) Let Dj^new ■= (-^“-P(i),*-P(i),*0-f*(j),new = -E'j,new-Rj,new denote its reduced QR decomposition, i.e. let Ej^^ew 
be a basis matrix for range (Z)j_new) and let Rj^new = Ej^aeJ E)j,new 

2) Let £^ynew,± be a basis matrix for the orthogonal complement o/range(£^ynew)- To be precise, £^ynew,± 
an n X {n — rj) basis matrix so that [Ej^^ew -E'j,new,±] A unitary. 

3) For u = Uj + 1 and u = Uj + k for k = 1,..., K, define A^, Au as 


A,, . — 


Au,i. ■— 


- ^ Ej^new'{T - {I - P(j)_*P(j)_*')-E'j,new 

t&J,, 

- ^ Ej^aew,±{I - P{j),*P{j),*)P^t {I - P{j),*P(j),*')Ej,new,± 

t&J,, 


and let 


■Alt •— 


Pj,new Ejuew,. 


Au 0 
0 


7? ' 

-t^j,new 

jp 

■^j,new,l. 
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4) For u = Uj + 1 and u = Uj + k for k = 1,..., K, define M-u <^nd FLu os 

■M-u = {I — ^~ 1 ~ ) 

and 




FLu ■— 

Remark 6.10. Recall the definition of from Algorithm 

Conditioned on Fj-i^end. far u = Uj + 1, Pua-i,* = P{j),* other words all j — 1 previous subspace changes 
were detected) and thus, for this value of u, 

-VuT>u' = Mu. 
a 

In this case, M-u is the matrix whose maximum eigenvalue is checked to detect subspace change. 

Conditioned on FJq, for u = Uj + k, k = 1,2,. .., K, Pua-i,* = P{j),* ond thus, for these values of u also, 

-PuPj = Mu. 
a 

In this case, M-u is the matrix whose eigenvectors with eigenvalue above thresh/orm P(^j'j^uew,k (s^^ step I of 
Algorithm [^. In other words, M-u has eigendecomposition 


*evd 
Mu = 


ID 

■^(j'),new,/c (ji),new,/c,± 


A. 


P f 

(j),new,fc 

p / 

^{j),new,k,l. 


Definition 6.11. Define 

1) Ks,* := Ks(P(j)) and Kg,new ■= maxj KsiP(j)^new). 

2) nf := 0.0215. we will show later in Lemma 7.8 this upper bounds ||/ 7 ;^Zlj_new ||2 under appropriate 
conditioning. 


3) (/)■*■ := 1.2. As we will show later in Lemma 6.15 this upper bounds fa '.= \\[{^t)Tt'{^t)Tt] under 
appropriate conditioning. 


Remark 6.12. The entire proof uses Model 5.1 on %. By Lemma 5.2 Model 2.3 is a special case of it. In particular. 


this means that (a) Model 2.3 also implies p^h'^ < 0.01 and (b) Model 2.3 also allows us to use Lemma 5.3 This 


lemma is used in the proof of Lemma 6.23 in Section VII 


B. Five Main Lemmas for Proving Theorem 2.7 


Fact 6.13. Observe that r“g both for a = Uj and a = Uj + 1 implies that Uj < Uj < Uj + 1. Thus, in both 
cases, tj < tj < tj + 2q;. So with the model assumption that d > {K + 2)a, we have that fau-^k ^ [tjfaj + d] for 
k = 1,2,..., K. This fact is needed so that we can use the “slow subspace change” inequality, Q, to bound the 
eigenvalues along the new directions, and so that we can bound ||at^new||oo by 7new 


Lemma 6.14. [Exponential decay of the bound on Cj,new,k (similar to y f72] Lemma 6.1])] Under the conditions of 
Theorem \2.7\ 

Ciew,fc<0-83" + 0.84rnewC 

This lemma follows by applying simple algebra on the definition and using the bounds assumed on (], and 
p^h~^ in Theorem 2.7 A detailed proof of this lemma is given in Appendix 

Lemma 6.15 (Sparse Recovery Lemma (similar to |[T^ Lemma 6.4])). Assume that all of the conditions of Theorem 
2.7 hold. Recall that SEj = dil{Pt, Pt). 
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1) Conditioned on rj_i^end. for t G [tj, (uj + l)a] 

a) ft := ||[($i)r/(^i)rj-i 2 < </>+ := 1.2. 

b) the support of Xt is recovered exactly i.e. Tt = Tt and et satisfies: 

et := xt-xt=£t- It = H 1% (18) 

c) Furthermore, 


SEt < 1 , and 

eth < + V^newTnew) < 1.2 + s/r new'Tnew^ 


2) For k = 2,3,...,K and uj = Uj or uj = Uj + 1, conditioned on for 

\{uj + k — l)a + 1, {uj + k)a], the first two conclusions above hold. That is, ft < f~^ rind 
Furthermore, 


t G 2Ji 


Uj+k — 


Et satisfies {18). 


SEi<C+ +Ciew,fc-l.««^ 

ll^tlb ^ “I" Cynew,k—lV'^new'lnew) 1.2 ^1.84y^ + (0.83) \/^new7new^ 

3) For Uj = Uj or Uj = Uj + 1, conditioned on for t G [{uj + K)a + 1, tj+i — 1], the first two conclusions 
above hold (ft<f~^ and et satisfies ([I])). Furthermore, 


SEt < , and 

etib <</>+(/+!,,%/^7 < 1.2yC 


Notice that cases 1) and 3) of the above lemma occur when the algorithm is in the detection phase, while during 
the intervals for case 2) the algorithm is performing projection-PCA. In case 1) new directions have been added 
but not estimated, so the error is larger. In case 2), the error is decaying exponentially with each estimation step. 
Finally, case 3) occurs after the new directions have been successfully estimated and contains the tightest error 
bounds. 

The proof is given in Appendix [C| 


Lemma 6.16 (No false detection of subspace changes). 

1) The event and so also the event Ej^nd imply that Cj+i,* ^ *. 

2) P (NODETS“ I E“^) = 1 far a = Uj or a = Uj + 1. 

Lemma 6.17 (Subspace change detected within 2q: frames). For y = 1,..., J, 

P (DET“^+1 I E,_i,end, DET^) > Pdet,i := 1 " PA " Pn- 


The definitions of pA and p<n can be found in Lemmas 6.2 1\ and 6.23\ respectively. 
Lemma 6.18 (A:-th iteration of pPCA works well). 


(E 


j,k 


r7-i) = 


(PPCAj I Ej > Pppca 1 PA PA,± P'H 


far a = Uj or a 
respectively. 


Uj + 1. The definitions of p a, Pa,l> and p<n can be found in Lemmas 6.21 6.22 and 6.23 


The above lemma says that, conditioned on fc — 1 previous successful p-PCA steps and on accurate recovery of 
P(j_i)^*, the probability of correctly estimating rj,new and of a successful k'^ projection PC A step is lower bounded 
by Pppca. This is true whether the new directions are detected at Uj or at Uj + 1. 
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C. Proof of Theorem 2.7 

Corollary 6.19. Let 

Pdet,0 :=IP(DET“^ I Tj-i^end)- 

From the above lemmas, we get that 

p (r,-end I rj_i,end) = p((det“^ h ppca“^ n • • • n ppcaJ;^ n nodets^ )u 

(DET^ n DET“^+^ n PPCAJ;^+^ n • • • n ppcaJ;;^^ n nodets“^+^) | Pj-i.end) 

= p (det“’ n PPCA“\ n • • • n ppcaJ;^ | rj-i,end) 

+ p (^DET^ n DET“^+i n PPCA“^+^ n • • • n ppca“^^ | Pj-i.end) 

^ PdetjO ■ (Pppca) P (1 Pdet,o) ' Pdet,l ' (Pppca) 

P PdetjO ■ Pdetjl ' (Pppca) P Pdet,o) ' Pdet,l ' (Pppca) 

~ Pdetjl ■ (Pppca) 

Proof of Theorem 2.7' Theorem |2.7| follows from Corollary 6.19 and the assumed lower hound on a. Notice 


that hy Lemma 6.14 the choice of K, and Lemma 6.15 the event Pj^end will imply all conclusions of the theorem. 


' j;,end I Pj —l,end) Pj—2,end) • • • j Pi,end) Po,end) 


By the first assumption (accurate initial suhspace knowledge) and the argument used to prove Lemma 6.16[ 
we get that P(Po,end) = 1- By the chain rule, P(Pj,end) = n/=iIP(Pi 

Because r^—i^end ^ Tj—2^end ^ * * * ^ T^^end ^ TQ^end? WC get 

J 

P(Pj,end) = I Pi-l,end) 

i=i 
,7 

~ 11 Pdetjl ■ (Pppca) = (Pdetjl ‘ (Pppca) ) 

7 = 1 

> 1 - 

The last line is hy the lower hound on a assumed in the theorem and the fact that Pdet.i > Pppca- 


D. Key Lemmas for Proving of Lemmas 6.16 6.17 and 6.18 


Before proving the lemmas from the preceding subsection, we introduce several lemmas which will he used in 
the proofs. 

The following lemma follows from the sin0 theorem l[2^ and Weyl’s theorem. It is taken from ifT^ . 

Lemma 6.20 (dll. Lemma 6.9). Atu = Uj+k, //'rank(PQ) new,fc) = i"j,new, and /f A min (A„) —||A„ i ||p —||'K„ ||p > 
0, then 

Cj,new,k < T la ) _ II4 II _ 11 -^ II d9) 

The next three lemmas each assert a high prohahility hound for one of the terms in ( [T9] ). In the following lemmas, 

let 

I’newC'^train |-2Q^ 


e = 


Lemma 6.21. Let pA := rnew exp ( 


100 

rnewexp ^ and 

\ — 


bA := (1 - (C+)")Anew - 2e. 
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For k = 1,..., K, 


If (■I^min i^-^Uj+k) > bA I Xu.^k-i) > 1 — Pa 

for all Xu^+k-i £ fe-i Uj = Uj or iij = Uj + 1. 

The same bound holds for Aniin(^Uj+i) when we condition on X^. G Fj-i^end- 

Lemma 6.22. Let PA,± ■= {n — rnew) exp 


For k = 1,..., K, 


8-1002 ' 
bA,± ■= (C+)'A+ + e. 


(-^max (-^{tj+fc,±) ^ ^A,± I l) ^ 1 PA,± 


for all Xu.j^k-\ £ ^j%-i ^bh Uj = Uj or Uj = Uj + 1. 

The same bound holds for Amax(24.Mj+i^_L) when we condition on X^^ G Fj-i^end- 


Lemma 6.23. Let 


p-H := nexp 


Ol’new C (\,ra,in)^ 


32 - 1002 (,/.+) 2 (VC +V ^new T'new ) 


+ nexp 


Ol’new C (I'^train)^ 


^8 • 1002(<^+(VC + V^7new))^ 


+ 


nexp 


ai’new^C^(Atrain)^ 


and 


where 


32 • 1002 (C + ^/CVf newTnew)^ j 

bu,k '■= ‘^ble,k + bee,k + 26f 


b(.e,k ■ — 


k = 1 


(/)+ (V p^/i+(C^^)^A+ + Kf A+g^) + e 

+ Cj>ew,fc-l^llew (v+ e /c > 2 


and 


^ee,k • — 


p2/i+(0+)2((^+j2^+ + (4)2 a+^) + e A: = 1 

p2/,+ (0+)2((C+j2(A+) + (Cig„^,_,)2(A+,)) + e k>2 


bp ■■= (C+)'A+ + 


For k = 1,... ,K, 


If (Il'^Uj+fclb < b-H,k I ^ — pn (21) 

/or all Xi^.j^k_i G Tj fc_i Uj = tij or Uj = Uj + 1 

The same bound (k = 1 case), i.e. ||'Hu^+i ||2 < b<n^i, also holds with the same probability when we condition 

on Xu^ G Lj—en(j. 


The above lemmas are proved in the next section (Section [VII[). The proofs use Fact 6.13 
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E. Proofs of Lemmas \6.16\ \6.17\ and \6.18 
Proof of Lemma 


6.16 


Recall that F 


j,end := n NODETSjM u n nodets“^+^ 


1) By the definition of both for Uj = Uj and Uj = uj + 1, < C/* = (^o + (j — l)?'new)C and 

Cj,K < Cj>ew,it'- Lemma 6.14 and the choice of K imply that Cj^new.i^ ^ ^newC- Thus, Cj+i,* < Cj,*+0,new,fc < 

= (^0 + jTnew)C- 

2) P(NODETS“^ I r“^) = P(^Amax(^T)«T^n) < thresh for all u G [uj + K + 1, Uj+i - 1] | for uj = Uj 

or Uj = Uj + 1. 

As shown in 1), FJ"^ implies that dif(P(j+i)_*, *) < d+i. ,* = (j’o +2>new)C- Recall that P(j+i),* = Pq). 

Also, for u G [uj+K+1, Uj+i — 1], Pua-i,* = Pq+i),*- Also, for all t £ Ju for these u’s, it = P(j)(it = P{j+i),*o.t- 
Therefore, 


Amax 1 PuPu 
a 


= A 


/ , (T PuQ,_i * ')itit (T Pua—l,*Pua—l,* ) 1 

« ^ / 

— -^max I E 


\2„,,2 


A 


< 4(</.+)2cA- 


train — 2 


train 


6.15 


The penultimate inequality uses the hound Q < assumed in 


r'^7^ 


The hound on comes from Lemma 
Theorem 12.71 

The next two proofs follow using the following two facts and the four lemmas from the previous subsection. 
Fact 6.24. For an event £ and random variable X, P(£^|X) > p for all X € C implies that ¥’{£\X G C) >p. 
Fact 6.25. Using the bounds on C, and on p^h'^ and using Q, we get 

bA > 0.94A-„ > 0.94A-^i^ 

bA,± < 0.011A-,i„ 

b-uL < 0.24A“^jjj. 

Thus, bA - b-u,k > O.SA^g^ij^ = thresh and bA,± + b-u,k < 0.25A“jj^;j^ < thresh. 


Proof of Lemma 6.17- We will prove that P (DET“^'''^ | Xu^^ > Pdet,i for all Xu^ G Fj_i end- In particular. 


this will imply that P(DET“^+^ | X^^) > Pdet,i for all Xu^ G Fj_i end F DET“j and so we can conclude that 
P(DET“^+l I r,_i,end,DET^) >pdet.l. 

Recall that , and observe that 

P (DET“^+^ I = P (Amax(AT«,+i) > thresh | 


Aniax('Aaii^.-|-l) ^ Aniax(''4iij+l) T Aniin("^iij+l) 
A Aniax(Anj^i) ||7^nj+l||2 

A Ainin(^nj+l) ||"^'Uj+l||2 


By Weyl’s Theorem 
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When Xu^ G rj_i end> Lemmas 6.21 and 6.23 applied with e given hy pO]) show that Ainin(^nj+i) > and 


with prohahility at least 1 —pA —p-H = Pdet,i- Using Fact 6.25 6^ — b-^^i > thresh and so the 
lemma follows. ■ 

Proof of Lemma 6.18’ To prove this Lemma we need to show two things. First, conditioned on 
the /c* estimate of the number of new directions is correct. That is: fjnew,fc = fj^aew Second, we must show 
Ci,new.fe < Cj>ew,fc’ ^gain Conditioned on 

Notice that fj,new,fc = i'ank(PQ) To show that rank(P(j) jjew,fc) = '>^j,new, we need to show that for 

u = Uj + k, k = I,..., K, Xr- > thresh and Xr- < thresh. To do this we proceed similarly 

to above. 


Observe that, + PLu- By Fact 6.25 bA > bA,±- Combining this with Lemmas 6.21 and 6.22 gives, 

Amin(^M) > Amax(^u,±) with probability at least 1 — pa — PA,± under the appropriate conditioning (conditioned 


on Since Au is of size r^^new x rynew, this means that Xr 

A max (A„ I ). Using this and Weyl’s Theorem, 


— Aniin(.^ii) and (A,) = 


Xr 


,{Mu) > X. 

> A 

— -^mm 


(A) — ll'^ulb 


and 


A. 


’ 3 ,new + l('^«) — -^1 

< A, 


j,new + l('^lt) “k -^max {n. 
■j.new + lCAi) + ll’^ttlb 


— Amax(.^u,±) T ||^^n||2 

with probability at least 1 — pa — Pa,± under the appropriate conditioning. Using Lemmas 6.21 6.22[ and 6.23 


applied with e given by ( |20l ) and Fact 6.25 we can conclude that with probability greater than Pppca^ Kj {-^u) > 
bA - b-H,k > thresh and A,.^ „^^+i(ATn) < 6 a,± + b-H,k < thresh. Therefore rank(P’(j) new,fc) = '^j.new with 
probability greater than Pppca under the appropriate conditioning. 

applied with e given by ([20l). Using 


To show that Cj,new,fc < Cynew,fc 


, we also use Lemmas 


ran. 


HPij). 


new,fc. 


= r 


j,new 


6.21 


6.22 


and 


6.23 


and applying Lemma 6.20 with these bounds gives the desired result. 


VII. Proofs of LEMMAs [6.2U|6.22i and |6.23 
A. Some definitions, remarks and facts 

Definition 7.1. Define the following for A; = 0,1,..., iT. Recall that P(j),new,o = 

-L^j,new,fc • {I ■^(i),new,fc-^(j),new,fe ')^(j),new ThuS Dj^riew D 

P-Pj,*,k ■ ),new,fc-F*(j)A swA P-Pj,* ■ PPj,*,0- 


3) Recall that Cj,new,0 — ||Llynew||2> Cj,new,k — ||Llj^new, 


,k\\2> Cj,* — \\D, 




j,new,0* 


Also, clearly, ||D 




II2 ^ Cj,*- 

Definition 7.2. For ease of notation, define 

it '■= {p — P{j),*P{j),* )^t 

Remark 7.3. In the rest of this section, for ease of notation, we do the following. 

• We remove the subscript j from Dj^riew,h A,new. cind Cj,new,k and from everything in Definitions 6.3 6.4 


6.5 6.9 and\7.1 
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• Similarly we also let := X{^.^k tind Fj. := Fj’^ for both Uj = Uj and Uj = Uj+i- More 

precisely, whenever we say P(event|Xfc_i G Ffc_i) > pQ we mean P ^event|Xu^.+fc_i G > po and 

P (^event|X„^+i+fc_i G F“^\^ > po- 

• Finally, to u = Uj + k 

Also, note the following. 

• The proof for the bound on Au for u = Uj + 1 is the same as that for u = Uj + 1 since in both cases 
Pt^^: = P(j),* and Pt,new = [•] for all t G Ju- The same is true for the bounds on Au-+i^±_ and FLuj+i- 


Fact 7.4. When X^-i G Ffc_i, 

1) \\D,, ^k-i\\ 2 <Ci,, for k = I,..., K. 

2 ) ||i:)new,fc-i||2 < Cnew,A:-i far k = I,..., K + I (by definition ofTk-i). 

3) Recall that C^wO ~ 


4) (from Lemma 6.15 ) 

6 ) '^min(-^new-^new ) ^ 1 (C=^) (this folloWS because ||-P=(: -F*new ||2 — ) -F^ 0 ’vy ||2 ^ C*) 

6) Enew Enew — Enew EnewEnew — -^new Cind Enew,X Enew — 0. 

2) £.f — E^^Clf :,, -j- Dnew0^t,new’ 

8 ) St satisfies (18) with probability one, i.e. et = Ipfi(^t)Tt\^t — P Enew,new 


)■ 


B. Preliminaries 

First observe that the matrices Enew, Rnew, -F'new. -F)*, Enew,k-i are all functions of the random variable Xf^_i. 
Since X^-i is independent of any at for t G Ju-+k, the same is true for these matrices. All terms that we bound 

are of the form F 'f 2 teja.+k where Zt = fi{Xk-i)Ytf 2 iXk-i), Yt is a sub-matrix of 


for Lemmas 


6.21 


and 


6.22 


ataf, and /i(.) and are functions of Thus, conditioned on Xk_i, the Zfis are mutually independent. 


All the terms that we bound for Lemma 6.23 contain e^. Using Lemma 6.15 conditioned on Xk_i, et satisfies 


(18 1 with probability one whenever Xk-i G Ffc_i. Using (18), it is easy to see that all the terms needed for this 
lemma are also of the above form whenever G Ffc_i. Thus, conditioned on Xk-i, the Zfis for all the above 

terms are mutually independent, whenever Xj^-i G Ffc_i. 

We will use the following corollaries of the matrix Hoeffding inequality from 1231 . These are proved in llT2l . 

Corollary 7.5 (Matrix Hoeffding conditioned on another random variable for a nonzero mean Hermitian matrix 
m, d). Given an a-length sequence {Zt} of random Hermitian matrices of size n x n, a r.v. X, and a set 
C of values that X can take. Assume that, for all X G C, (i) Zfis are conditionally independent given X; (ii) 
il A Zt h 2 l\X) = 1 and (Hi) b^I A ^ 'ff^^(Zt\X) A b^I. Then for all e > 0, 





a 


1 


<64 + 6 


Et) > bs — e 


X \ > 1 — n exp 


X \ > 1 — n exp 


—ae 


8(62 - 6l)2 


—ae 


for all X £ C 


for all X ^ C 


a ^ I I ^ \ 8 (b 2 — 

Corollary 7.6 (Matrix Hoeffding conditioned on another random variable for an arbitrary nonzero mean matrix). 
Given an a-length sequence {Zt} of random matrices of size ni x 77 - 2 , a r.v. X, and a set C of values that X can 
take. Assume that, for all X G C, (i) Zfs are conditionally independent given X; (ii) P(||.Zt ||2 < bi\X) = 1 and 
(^o) \\^'f 2 ^^{Zt\X )\\2 < 62 . Then, for all e > 0, 

A' 


a 


<62 + 6 


X \ > 1 - (ni + 772 ) exp 


—ae 

32bF 


for all X ^ C 
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^ and < A^g^ < A^g-^,^ < ^^train 


C. Simple Lemmas Needed for the Proofs 

Lemma 7.7. For j = 1,..., J and k = 1,... ,K, for all G 

ij 0 ^ E I Xu^+k-i] = M,* ^ A+I 

Ajiew-^ — \_t^t,new(^t,ne’w \ l] — ^t,new fi ''new 

3) E [Q.f,*Clt,new I = 0 

with Uj = Uj or Uj = Uj + 1. 

The same bounds also hold for summation over t G ffuj+i when we condition on Xu. G Fj-i^end- 
Proof: The proof follows from Model 


2.2 


and Fact 


6.13 


The only reason we need Xu^+k-i £ 
apply Fact 6.13 which allows us to lower and upper hound in the eigenvalues of new by A“g„ and A+g„ and 
then use Q. ■ 


Lemma 7.8. Assume that the assumptions of Theorem 2.7 hold. Recall that Dnew = L)new,o- Conditioned on 

Afc_i G rfc_i, 

II V'Dnewlb < 4 := .0215 (22) 

for all T such that \T\ < s. 

The proof is in Appendix 


D. Proofs of Lemma 6.21 and 6.22 


Proof of Lemma 6.21 ■ We obtain the hounds on Au for u = u,■ + fe for A: = 1, 2,..., A and uj = Uj or 


Uj + 1. For u = Uj -\- k, recall that Au := ^ Y.t Enew'^dt^new 

Notice that F/^ew — -f^new^t,new “1“ -^new Lct — FJjyew^t,new^i,new -^new ? and Ict Yf — 

-Rnew^£,new^t,* -D* -F^new + -E^new -^^new ) then 

Au>i-4Zt + -4Yt (23) 

a ^ a ^ 

Consider (1) "^be Zfs are conditionally independent given Xj^-i. (2) With prohahility 1, 

ll'^tlb < ’"newTnew^- (3) Using a theorem of Ostrowoski ll2^ Theorem 4.5.9], conditioned on X^-i G Ffc-i, 
Amin Zt\Xf^_]jj — Amin (7?new(Q ■^t,new)7?new ) ^ Amin (7?new7?new ) Amin ^ (1 


(C* )"^)Ang„. The last inequality uses Lemma 7.7 and Fact 7.4 


Thus, applying Corollary |7.5| with e given hy ( [2()| l, we get that, for all Xk-i G Ffc-i, 
P I Amin I - > Z, I XI - (X")A- - e X 


>(i-(C+)2)A-g„-e 


k-l ^ 


> 1 - r-n 


. exp 


-«C (Afaain)' 


8 • 1002.7new^ 


(24) 


Consider Yt = Ruewat,ney~,CLt,*'Df Euey~, + E^eJD*cit,*o-t,new'RneJ■ (1) The Yf^ are conditionally independent 
given Xfc_i. (2) Using the hound on C, from the theorem, ||1^|| < 2^rnew?'Ci'" 77new < 2Xnew?’C,^7^ < 2 holds 
with prohahility one for all X^-i G Ffc-i. Thus, under the same conditioning, —21 AYt <21 with with prohahility 
one. (3) By Lemma [t?^ E (T Yt\Xk-i) = 0 for all Xk-i G Tfc-i. 

Thus, applying Corollary |7.5| with e given hy ([20|l, we get that, for all X^-i G Ffc-i 


Ar 




t < 


> -e 




k-l 


> 1 — c exp 


Qi?’new^C^(Atra,in) 

8 • 1002 • ( 4)2 


(25) 


Combining p^, ([24]) and (25) and using the union bound, we get the lemma. 


Proof of Lemma 6.22' Remark 7.3 applies. 























32 


We obtain the bounds on ^ iox u = itj + k for k = 1,2, ...K with itj = Uj or Uj + 1. For all these rt’s, recall 
that _i_ . — -^new,_L -^new,_L- Using -F^new,_L U^new that -E/new,_L -E^new,_L ThuS, 

Aji^_L — — Zi with Zi — -£/new,± jf Djf -E'neWjX- 

Using the same ideas as for the previous proof we can show that 0 :< Zt < < Cl and 

E (i Zt\Xk-i) ^ (C+)'A+/. Thus by Corollary 


7.5 


the lemma follows. 


E. Proof of Lemma 6.23 


Proof of Lemma 6.23- Remark 7.3 applies. Using the expression for PLu given in Definition 6.9 and noting 
that for a basis matrix E, EE’ + E±E± = / we get that 

TEn = ^ J]; ((/ - P,Pj)etet'{I - P,PJ) - {itet'il - P,Pf) + (I - P,Pj)etit') + {Ft + Ff 


t&J^ 


where 


Thus, 


Ft — E I E I 'ftft' E E ' 

t — -*-'new,±-*-^new,± -^new-^new • 


|'EEm||2 < 2 


-VW + -Ve^e/ +2 -y^Ft 

a r, a r, a 


(26) 


Next we obtain high probability bounds on each of the three terms on the right hand side of (261. 

Consider || A Using Lemma 6.15 et satisfies ( [T^ with probability one for all Xj^-i G T^-i. 


Let Zt := Itef. (1) Conditioned on X^-i, the various Zfs used in the summation are mutually independent, 
for all Xk-i G T^.i. (2) For Xk-i G F^.i, 

ll-^ilb = II 2 ^ ^C*~ \/^T T v/^newTnew^ \/^7 T Cnew,fc— l\/U‘®wTnew)^ •= 

holds with probability one. (3) First consider the k > 2 case. When Xk_i G rfc_i. 


E 


a 


yitet' I X, 


k-l 


- y~l + DnewXt,newDnevj,k-l'^lTt[i^t)Tt'{^t)Tt] ^ 1% 


< 




^ ^ ^ 1 -C^new-^i,new-Dnew,/c—1 J 1 “1“ -^new-^i,new-C^ 


'new,/c—1 


1 




< ((C+)'A+ + Cew,fc-lA+w) 


The first inequality is by Cauchy-Schwarz for a sum of matrices. This can be found as Lemma D.2| in Appendix 
[d| The second inequality uses Fact 7.4 (for the first term of the product) and Lemma 5.3 with (for 

the second term of the product). 


Now consider the k = 1 case. To bound 


A Et D,Xt,.DyiT, [($t)r/(^t)rJ-'E7;' 


we pro- 


ceed exactly as we did for the k > 2 case. We can bound this by {Cfp'^h+f'^. To bound 
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i Ei I?newAi,new£>new,o'/r. [{^1% 


, we apply Lemma 


7.8 


this and Fact 7.4 we can bound this by ■ Thus, when Xq G Fq, 

^ Vw I X, 

rv ^ ' 


to gei 


|-Dnew,o'TrJ|2 < ■ Using 


E 


t 

< (V^(C+)'A+ + k+A+w) </>+• 

Thus, by Corollary |7.6| with e given by (201, we get that, for all X^-i G rfc_i. 


- vw 

a ^ 


< ht 


e.k 


Xk-i > 1 — nexp 


Qi^new C (Atrain) 
32 • 100263^ 


(27) 


Consider || A Ei Let Zt = ete/. (1) Conditioned on Xk-i, the various Zt’s in the summation are 

independent, for all Xk_i G rfc_i. (2) Using Lemma 6.15[ conditioned on Xk_i G rfc_i, 

0^Zt< + Cnew,fc-l\/^7new)) I ■= hi 

with probability one. (3) By Fact 7.4 when X^-i G Ffc-i, 


<C-1 


^+ L>new,fc-lAt,new-Dnew,A:-ljTrJ(^t)r/(^t)rt] ^ 1% 

with 


When = 1 we can apply Lemma 7.8 to get that ||Z7new,o^T7; ||2 < nf. Then we apply Lemma 5.3 
a+ = {(j)+f ((C+)^A+ + (k+)^A+„). This gives 


0 ^ E 




Xo 


^ p'/^+(<^+)'((C+)'A+ + (4)2 a+„) J for all Xq G Fq. 


When /c > 2 we can apply Lemma with (t+ = {cj)'^f (^(C+)^A+ + (Cnew,A:-i)^A+to get that. 


0 ^ E 




Xk-1 


^ P^h+{cl,+f[{C?X+ + (Cew.fc-i)'A+,)/ for all Xfc_i G F^.i. 


Thus, applying Corollary 7.5 with e given by ( |20l ), we get that, for all X^-i G Ffc_i 

1 


a 




Y bee.k 


Xk-i > 1 — nexp 


-«^LwC^(Atrain)^ 

8- 


(28) 


Finally, consider ||A Et n|| 2 - Since Eaew,±'Dnew = 0, 

Ft — E \ E I 'U/F E E ' 

t — -*-'new,±-*-'new,± ^new^new 

— -^new,±-^new,± Dnew^t^new) -E^new-S^new 

(1) Conditioned on Xk-i, the Ft's are mutually independent, for all Xk-i E Tk-i- (2) For Xk-i E Tk-i, 

\\Fth < + Cv/FrnE77new := h 


'Notice that if we want to use the bound of Lemma 


7.8 


we cannot also apply Lemma 5.3 for this term. We can get a simpler proof by 
not using Lemma 7.8 at all and proceeding exactly as we did for the k >2 case; but doing this will require a much tighter bound on 


than what we currently need. 
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holds with prohahility 1. (3) For X}^_i G rfc_i 




Ft Xk-t 


< 


1 


a 




< (C+)'A+ = bp 


Applying Corollary 7.6 with e given hy ( [20| ), we get that, for all Xk-\ G r^-i 

1 



< bp 


Xk-i > 1 — nexp - ^ o — 

/“ I 32-1002652 


(29) 


Comhining ph]) with (27 1 , (281 and ([29]) and using the union hound, we get the lemma. The expression for 


given in the lemma uses the hounds on ( from the theorem and uses the loose hound C/new fc-i — ^ ('■° S®*- ^ 
simpler expression for the prohahilities). ■ 


VIII. Simulation Experiments 

In this section we provide some simulations that demonstrate the robust PCA result we have proven above. More 
detailed simulations using real data can be found in ifTTl . 

The data for Figure]^ was generated as follows. We chose n = 256 and fmax = 15,000. Each measurement had 
s = 20 missing or corrupted entries, i.e. \Tt\ = 20. Each non-zero entry of Xt was drawn uniformly at random 
between 2 and 6 independent of other entries and other times t. In Figure the support of xt changes as assumed 
in Model 2.3 with p = 2 and /3 = 18. So the support of Xt changes by | = 10 indices every 18 time instants. 


2.2 


3^1 '^A-ain With Vi = 1.00017 
Entries of at were independent 


When the support of Xt reaches the bottom of the vector, it starts over again at the top. This pattern can be seen 
in the bottom half of the figure which shows the sparsity pattern of the matrix S = [xi,..., 

To form the low dimensional vectors £t, we started with an n x r matrix of i.i.d. Gaussian entries and 
orthonormalized the columns using Gram-Schmidt. The first vq = 10 columns of this matrix formed 7^(0) ^ the 
next 2 columns formed F’(l),new^ ^^d the last 2 columns formed P( 2 ),new We show two subspace changes which 
occur at ti = 600 and t 2 = 8000. The entries of at,* were drawn uniformly at random between -5 and 5, and the 
entries of at,new were drawn uniformly at random between — \JSvj A^ain 

and = 1 (and qi = 1). Thus (At,new)i,i = assumed in Model 

of each other and of the other at’s. 

For this simulated data we compare the performance of ReProCS and PGP The plots show the relative error in 
recovering tt, that is \\£t — lb- For the initial subspace estimate Pq, we used Pq plus some small Gaussian 

noise and then obtained orthonormal columns. We set a = 800 and K = 6. For the PGP algorithm, we perform 
the optimization every a time instants using all of the data up to that point. So the first time PGP is performed on 
[mi,..., nia] and the second time it is performed on [mi,..., m 2 a] and so on. 

Figure illustrates the result we have proven. That is ReProCS takes advantage of the initial subspace estimate 
and slow subspace change (including the bound on 7new) to handle the case when the supports of Xt are correlated 
in time. Notice how the ReProCS error increases after a subspace change, but decays exponentially with each 
projection PCA step. For this data, the PGP program fails to give a meaningful estimate for all but a few times. The 
average time taken by the ReProCS algorithm was 52 seconds, while PGP averaged over 5 minutes. Simulations 
were coded in MATEAB® and run on a desktop computer with a 3.2 GFfz processor. 


IX. Extensions 

In this section, we first give other models on changes in 7* that are special cases of the general model Model 
15.11 and hence can also be used in Theorem 12.51 or 12.71 The next three subsections discuss various other results that 
can also be proved using the proof techniques developed in this work. 
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Fig. 4: Comparison of ReProCS and PCP for the RPCA problem. The top plot is the relative error — ^i|| 2 /||^t lb- 
The bottom plot shows the sparsity pattern of S (black represents a non-zero entry). Results are averaged over 100 
simulations and plotted every 300 time instants. 


A. Other Models on Changes in % 


We give here other models on changes in Tt that are special cases of Model 5.1 


Model 9.1. Suppose that Tt consists of consecutive indices and is of size s or less, i.e. \T\ < s. When % is not 
empty, let dt denote its smallest (topmost) index. Let pi be an integer. We assume that dt satisfies the following 
Bemoulli-Gaussian model: 

dt = \ot mod n] where ot = ot-i + Ot ( 1.1—h wt 

V P 

where wt ~ AA(0, ci^) (Gaussian) and 6t ~ Bernoulli(q). Assume that {wt}, {0t} are mutually independent and 
independent of it ’s. Taking the mod with respect to n describes the process of the set Tt starting over at 1 when 
its topmost index exceeds n (this models a new object appearing after the old one has disappeared; notice that at 
any t Tt could be empty as well, i.e. there may be no object). 

Assume that s < q > 1 — —-)^ for a (3 that satisfies < 0.01, and < 4 ooop 2 log(n) • 

Model 9.2. Suppose that Tt consists of s consecutive indices and suppose that it moves down the vector by between 
1 and m indices at every time t. When it reaches the bottom of the vector, we assume that it starts over at 1. 
Assume that s < 0.0025a and m < 

— — a 


Model 9.3. In both models above we let Tt contain consecutive indices. This models a moving ID object of length 
s or less that enters the scene and eventually walks out, and then another object of length s or less may come in. 


However notice that nothing in our general model. Model 5.1 requires the indices to be consecutive or contiguous 
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Til) 


T{2) 


7(3) 


Fig. 5: Model 


9.2 


in any way. Thus in both of Models 9.1 and 9.2 above, instead of one moving object, we can also have multiple 
moving objects as long as the union of their supports is of size at most s and satisfies one of these models. Also, 
with minor changes, the object(s) instead of leaving the scene can reflect back up and start moving in the other 
direction as well. 


Lemma 9.4. If tmax < then Model \9.l \ is a special case of Model \2.3\ (and hence a special case of Model 
5.1) with probability at least 1 — n~^^. 


Proof: The proof has three steps, (a) We first use standard arguments about a Bernoulli sequence Il25ll to prove 
that the object moves at least once every (5 time instants with probability at least 1 — The choice of q 

ensures that this holds, (b) Next we use a standard Gaussian tail bound argument to show that, with probability at 
least 1 — 0.5re“^°, when it moves, it moves by at least s/p indices and at most l.2s/p indices. The bound on 
ensures this, (c) The above two claims ensure that, w.h.p., the object remains static for at most /3 frames at a time 
and when it moves it moves by at least s/p indices and at most l.2s/p indices. Notice that all the motion is in one 
direction. Motion by at least s/p in one direction ensures that after the object moves p times, i.e. after p changes of 
Tt, the sets are disjoint, i.e. n = 0. Motion by at most 1.2s/p in one direction and l-2^a < n ensures 

the third condition of Model [23] holds even when the object moves at every frame. ■ 


Lemma 9.5. Model 9.2 is a special case of Model 5.1 with p = 2 and h'^ = s/a. 


See Figure for a diagram of the model and the idea behind its proof. 

Proof: 

For the sake of clarity, we will prove the case when the object moves exactly 1 index at every time t. The only 
difference in the general case is the construction of the 

Consider an interval Ju. Let tu '■= {u — l)a + 1 denote the first time in Ju. Without loss of generality (because 
we can re-label the indices) let the object start at the top of the vector. That is 7t„ = [1,5]. Let lu = Let 
T(i),u = [(* - 1)'S + for f = 1,2,..., |_^J. If ^ is not an integer, also define 7(|--]),„ = [LlJ ® 

Define J(i)^u ■= [tu + {i — l)s, tu + is — 1] for f = 1, 2,..., |_|J . If | is not an integer, also define ) u ~ 

[tu + |_f J s,tu + a — 1]. 
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Clearly defined above are a partition of Ju- Also, by construction, for all t G J[i)^u^ Tt C 

This follows from three facts 1) the assumption that 7i„ = [l,s] (which is just a renumbering of the indices to 
make the numbers clearer) 2) the object moves down by exactly one index at each time t and 3) m < so that 
once an index leaves Tt, it will not return in the next a time instants. A simpler way of stating fact 3) is that the 
total motion is such that % does not return to where it started i.e. 7t„ H T^+a = 0- 

Notice that ^ s for all i. (With the possible exception of the last set, they all have size exactly s.) So under 

the assumptions of Model 9.2 /i^(a) < s, which satisfies Model |5.l| with h~^ = ^, < 0.0025a = ° ° 


2 = 


B. Analyze the ReProCS algorithm that also removes the deleted directions from the subspace estimate 

The tools introduced in this paper - (a) Lemma 5.3 and the way it is applied to bound PLu in Lemma 6.23t 


and (b) the detection lemma (Lemma 6.17 1 , the no false detection lemma (Lemma 6.161 and the p-PCA lemma 


(Lemma 6.181 - can also be used to get a correctness result for a practical modification of ReProCS with cluster- 


PCA (ReProCS-cPCA) which is Algorithm 2 of llT2ll . This algorithm was introduced to also remove the deleted 
directions from the subspace estimate. It does this by re-estimating the previous subspace at a time after the newly 
added subspace has been accurately estimated (i.e. at a time after ij + Ka). A partial result for this algorithm was 
proved in ifT^ . 

This result will need one extra assumption - it will need the eigenvalues of the covariance matrix of £t to be 
clustered for a period of time after the subspace change has stabilized, i.e. for a period of d 2 frames in the interval 
[tj + d + l, — 1] - but it will have a key advantage. It will need a much weaker denseness assumption and hence 
a much weaker bound on r or rmat- In particular, with this result we expect to be able to allow r = r^at £ 0{n) 
with the same assumptions on s and Smat that we currently allow. This requirement is almost as weak as that of 
PCP 


C. Relax the independence assumption on it A 

The results in this work assume that the ifs are independent over time and zero mean; this is a valid model 
when background images have independent random variations about a fixed mean. Using the tools developed in 
this paper, a similar result can also be obtained for the more general case of ifs following an autoregressive model. 
This will allow the ifs to be correlated over time. A partial result for this case was obtained in [?]. The main 
change in this case will be that we will need to apply the matrix Azuma inequality from |[23l instead of matrix 
Hoeffding. This is will also require algebraic manipulation of sums and some other important modifications, as 
explained in [?], so that the constant term after conditioning on past values of the matrix is small. 

D. Noisy and Undersampled Online Matrix Completion or Online Robust PCA 

We expect that the tools introduced in this paper can also be used to analyze the noisy case, i.e. the case of 
rrit = Xt + £t + Wt where Wt is small bounded noise. In most practical video applications, while the foreground 
is truly sparse, the background is only approximately low-rank. The modeling error can be handled as Wt. The 
proposed algorithms already apply without modification to this case (see ifTTll for results on real videos). The reason 
that our tools will directly extend to the noisy case is this: the sparse recovery step is already a noisy sparse recovery 
one, its analysis will not change if we also add in more noise due to Wt- If tt and Wt are assumed independent, 
then there should be few simple modifications to the analysis of the p-PCA step as well. 

Finally, we expect both the algorithm and the proof techniques to apply with simple changes to the undersampled 
case rrit = AtXt + Btit + Wt as long as Bt is not time-varying, i.e. Bt = Bq. A partial result for this case was 
obtained in ll2^ and experiments were shown in lUTll . 
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X. Conclusions 


In this work, we obtained correctness results for online robust PCA and for online matrix completion. Both 
results needed four key assumptions: (a) accurate initial subspace knowledge; (b) slow subspace change and mutual 
independence of the £t’s according to Model [T2| (c) some changes in the set of missing entries (or in the set of 
outlier-corrupted entries) over time, one way to quantify what is needed is given in Model 2.3; (d) a denseness 
assumption on the columns of the subspace basis matrices of it, and (e) algorithm parameters are appropriately 
set. 

Ongoing work includes obtaining the results mentioned in Sections IX-B[ IX-C and IX-D Besides these, we 
expect the proof techniques developed here to apply to various other problems involving PCA with data and noise 
terms being correlated. 


Proof of Lemma 5.2 


Appendix A 

Proof that Model [23] on % satisfies the general Model [57T] 

Consider an interval Ju- We will construct one set of mutually disjoints sets 


that are subsets of {l,2,...n} and a partition {J7(i),n}i=i,2,...i„ of so that for all t G J(i)^w 
( [T0| holds and so that hu{a;{T{^i)^u} < P for this choice. Since h*^{a) takes the minimum over all such 


sets, this will imply K^{a) < /?. By setting = fi/a and using the Model 2.3 assumption < 0.01a, we will 
be done. 

that Tt = for all t E with < 13 and |r[^l| < s. 


Recall from Model 


2.3 


Let til •— — l)a “t” 1 denote the first time index of fJu. Let kn be the index k for which tn E In 

other words, Define lu to be the number of intervals that have non-empty intersection with 

J'u- So lu is one plus the number of times Tt changes in the interval Ju- For f = 1,2, .../„ — 1, define 

and set T(i^)^u = Clearly lu < a. Thus, by the Model 2.3 assumption (for any k and i such that 

k < i < k + a, \ n (T^^^ \ = 0), the mutually disjoint. 

Next, define a parfifion of Ju as 

J{Tu ■= n Ju for i = 1,2,.. ./„ 

By Model 2.3 1 < tk+i — tk < P for all k. Since J(i)^u ^ < /3 for alH = 1, 2,... lu- 

Nofice fhaf for all t E J{i)^u^ T = So if we can show fhaf c U7(j+i)^„ • • •U7(i+p_i)_„ 

for alH = 1, 2,... lu, we will be done since fhis will imply h*^{Q) < (3. To show fhis, set k = ku + i — 1- Then, 

rW = T(i),u U n 

= T^i),u u n \ u n n 
c ^),u u 7j,+i),„ U [rt'i n n 

= T^i),u U T^i+i),u U n n \ 7-[fc+3]) u p -j-ik+i] p j-[k+ 2 ] p ^[fc+3]] 

c u u 7j,+2) U [rt'l n n n 

Continuing in the same manner as above, we get, 

= T(i),u U 7(i+i),„ U • • • U 7(i+p_i),„ (30) 


The last line is because T^^^ H = 0 by Model 2.3 
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Appendix B 


Proof of Lemma 


6.14 


(bound on fc) and of Lemma 


7.1 


Proof of Lemma 6.14’ This proof’s approach is similar to that of ifT^ Lemma 6.1]. The details have some 
differences because our main result now uses different assumptions. 


This lemma uses Model 5.1 As shown in Lemma [5^ Model [23] is a special case of thi s general model. 

Recall,ha, 

Recall that e = O.OlrnewC^train- Divide the numerator and denominator hy A. ■ . Define 


with the terms on the RHS defined in Lemmas 


6.21 


6.22 


6.23 


Bk ■= 


+ 2Ktct>+ 

P'^+(</>+)2Cynew,fc-l + 2x/p^</>+ 


■^trai 




k = 1 
k>2 


Ck-.= 


P^h+{f+)\C^Jr + 2^/^f+iC+Jr + 2(C+)r 






A+ 


train ^ 


+ 0.05 


Then, 


Dk : = 1 - (Ct)" - (C+) 


^ A+ 

k "^train . 


^ynew,fc-l-®fc ?’newC(C'fc + -02) 


/"+ ^ /-+ I f 

’j,new,fc — +,new,A:—1 ' ^newS • 


Recall fhat nf = 0.0215 and = 1.2. If is nof difficulf fo see fhaf k increasing funcfion of , r, 
(, Ct^^> ^nd jS^ and C/new fc-i- Consider k = 1. Using C/new o ~ ^ upper hounds assumed in Theorem 

I ^train '’'train 

2.7 on fhe above quanfifies, we gef fhaf C+new,l < 0.18. 

Thus, Cynew,i < Cnew,o = I' Using fhis and fhe facf fhaf C^ew,*: ^ an increasing funcfion of Ciew,fc-i’ ^e can 
show by inducfion fhaf C+.^w.fe < C+new,Ai-i- Thus, < C+new,i < 0-18 for all fe = 1, 2... A^. 

Using C/jjewfc — 0.18 and fhe bounds assumed in Theorem 2.7 on fhe ofher quanfifies we gef fhaf 


Ciew,;t<0.83Ciew,;c-i + 0.14rnewC 


Using fhis, we get 


k-l 


Ciew,fc < 0.83Ciew,fc-i + O.UrnewC < C+new,o(0.83)^ + (0.83)*(0.14) wC 


i=0 
oo 

< C+new,o( 0 . 83 )'= + J]( 0 . 83 )*( 0 . 14 )wC 

i=0 


< 0.83^ + 0.84rnewC. 


Proof of Lemma 


7.8 


Recall that Dj^^ew = {I - PQ),*PQ),*')P(j),„ew- Then ||/r P. 


j',new||2 — 


11-^7” (-^ )-^(7),new II 2 ^ ||-^7” -f*(_ 7 ),new ||2 H“ ll■^(7),* -^(j),new||2 ^ (-P(_;),new) 

^{j),*P(j),*')P{j),new\\2 < i^s{P(j),new) + Cj,*- The event Xi,^+k-i e implies that Cj,* < < 0.0015. 

Thus, the lemma follows. ■ 
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Appendix C 

Proof of the Compressed Sensing (CS) Lemma (Lemma [6.15[ ) 

This proof’s approach is similar to that of ifT^ Lemma 6.4]. The details have some differences because our 
main result now uses different assumptions. The proof uses the denseness assumption and suhspace error hounds 
Cj,* < and Ci,new,fc-1 < Cj^new,fc-1’ ^hcn G for uj = uj or Uj = Uj + 1, to obtain 

bounds on the restricted isometry constant (RIC) of the sparse recovery matrix and the sparse recovery error 
Applying the noisy compressed sensing (CS) result from 1191 and the assumed bounds on ^ and 7, the 
lemma follows. 

Lemma C.l. / [72] Lemma 2.10] Suppose that P, P and Q are three basis matrices. Also, P and P are of the 
same size, Q'P = 0 and ||(/ — PP')P ||2 = C*- Then, 

1) 11(1 - PP')PP'\\2 = ||(/ - PP')PP'\\2 = ||(/ - PP')Ph = ll(^ - PP')Ph = C* 

2) \\PP' - PP'\\2 < 2\\{I - PP')P\\2 = 2C* 

3} \\P'Qh < C 


4) {{I - PP')Q) < 1 


We begin by first bounding the RIC of the CS matrix We will use the notation ^1{P) to mean MP)f. 

Lemma C.2 (Bounding the RIC of ifT^ Lemma 6.6]). Recall that := ||(/ — ^,')P(j) * II 2 . The 

following hold. 

1) Suppose that a basis matrix P can be split as P = [Pi P 2 ] where Pi and P 2 are also basis matrices. Then 

kI{P) = max7-.|7-|<^ II V^lli < i^liPi) + kI{P2)- 

2) K^siPij),*) < + 2C* for all j 

3) ris{P(^j'j^ne-w,k) — ^s,new T C;,new,fc T Cj,* far all j and k. 

4) For t G [{Uj-i + K)a + 1, {uj + l)a), Ssi^t) = i^liPu),*) < + 2Ci>. 

5) For k = 1,... ,K -1, for t G [{uj + k)a + 1, {uj + k + l)a] 6s{^t) = Kii[P{j),* Pu),new,k]) < i4{P(j),*) + 

^siP{j),new,k) — T “^Cj,* T (^s,new T Cj,new,k T Cj,*) ■ 

Proof: 

1) Recall that kI{P) = max|r|<. HV-PlIi- Also, HV^’lli = llVi^i P 2 ][Pi P 2 ]'Irh = WiPiPi + 
P 2 P 2 )It \\2 < ||.?^r^-pL-f*i^-^Tll 2 + ||.?^r^-P 2 .P 2 ^.?^rl| 2 - Thus, the inequality follows. 

2) For any set T with |r| < s, ||/r'P(j),*||i = ||/r'-Po'),*-Po'),*'i'r II2 = \\It {P(j),*P(j),*' - P{j),*P{j),* + 

P{j),*P{j),*)^T \\2 < \\Tt' {P(j),*P(j) *' - P(j),*P{j),*)3^T\\2 + \\Tt' P{j),*P{j),*' Tt \\2 < 2Cj> + The 

last inequality follows using Lemma C.l with P = P(j),* and P = Pq),*. 

with 


3) By Lemma C.l with P = P(j),*, P = P{j),* and Q = P(j), 


I-P(i),new'-P(i),*ll2 < Cj,*- By Lemma 


C.l 


P P(j),new and P P(j),new,k’ II (-^ P(j),newP(j),new )-^(i),new,fc II2 II (-^ P{j),new,kP(j),new,k H^" 

For any set P with |P| ^ S, ||.fT-^(j),new,fcl|2 ^ ll-^T {T P{j),newP(j),new )P(j),new,k\\2 T 

\\Tp P(^j),new P(j),new P{j),new ,k\\2 ^ ll(-^ P(j),new P(j),new )-^(j),new,fc II 2 T ||.^T-f*(i),iiewl|2 ll(-^ 

P{j),new,kP{j),new,k )) i^ew 112 T ||.^T -f*(i),newl|2 ^ II-^(j),new,fc II2 T II-^(j),*-^(j),* ll-^T 112- 

Taking max over |T| < s the claim follows. 

4) This follows using Lemma |2.9| and the second claim of this lemma. 


5) This follows using Lemma 2.9 and the first three claims of this lemma. 


Corollary C.3. 











41 


1} Conditioned on rj_i end> for t G [tj, {uj + l)a], < 0-1 < 0.1479, and 

||[($i)r/(^i)rJ-i2 < < 1.2 := f+. 

2) For k = 2,..., K and Uj = Uj or Uj = Uj + 1, conditioned on for t G [{uj + k — l)a + 1, {uj + k)a\, 

Ssm < S2s{^t) < {k2s,*? + 2C+ + (^2., new + C+new,fc-l + C/j" < 0.1479, and || [(^t)r/(^t)rj-'lb < 
<1.2:= 6+. 


1-54^0 


3) For Uj = Uj or Uj = Uj + 1, conditioned on for t G [{uj + K)a + l,ij+i — 1], < (52s(^t) < 

{tt 2 s,*? + 2C+ < 0.1 < 0.1479, and ||[(^i)r/(^brj“i 2 < < 1-2 := b+. 


Proof: This follows using Lemma 
Lemma 16.141 


C.2 


the definitions of r,_i end and r“i, and the hound on C 

J ! J^K 


^j,new,/c—1 


from 


The following are straightforward hounds that will he useful for the proof of Lemma 6.15 


Fact C.4. Under the assumptions of Theorem |2. 7 
VC 


Ct < 


<^/C 


V,* ' - ^r-o + (J-l)c 

Cjtnew,fc-i ^ 0.83'^“^ + 0.84rnewC (from Lemma 

C/newfc-lTnew < 0.83*'“^7new + 0.84rnewC7new < 0.83^“^7new + 0.3VC 


6.14) 


Proof of Lemma 6.15- We will prove claim 2). The others are done in the same way. 

Recall that implies that Cj,* < Cy* and Cj,new,k-i < Cynew,fc-i- 

a) For t G [{uj + k-l)a + l, (uj + k)a], bt := (I - Pt-iPt-i)£t = Dj^^^k-iat,* + Dj^aew,k-iat,new Thus, using 
FactlCH 


l^tlb < Cj>\/^7 + C: ?, ne w ,k—l \/^newTnew 

< (0.83^-Vnew + 0.84yC)Vw 

= V^new0.83^“^7new + s/CipT + 0.84y7wb) < 


h) By Corollary C.3 (52s(^t) < 0.15 < \/2 — 1. Given |7t| < s, \\bt \\2 < 6 by the theorem in |[T^ . the CS error 
satisfies 

4v'l + 52s(^i) 


|^t,cs ^i||2 ^ 


l-(V2 + l)52s(^i) 


C<7e 


c) Using the above, \\xt,cs-Xt\\oo < 7^- Since minigT; |(a;bi| > x^m and = 0, minigr* |(®t,cs)i| > a: min -7£ 

and maxjg-f^ |(® 7 cs)i| < 7£. If w < Xmin — 7£, then Tt ^ Tt. On the other hand, if cj > 7£, then Tt CTt- Since 
UJ satisfies 7£ < w < Xmin — 7£, fhe support of Xt is exactly recovered, i.e. Tt = %. 

d) Given Tt = Tt, the least squares estimate of Xt satisfies {xt)Tt = [i^t)Tt]^yt = [{^t)Tt]H^tXt + ^tP) and 

{xt)r^ = 0. Also, = Irf^t (this follows since = ^tlpt and Using this, 

the error := Xt — Xt satisfies (18). Thus, using Fact El and the hounds on ||at||oo and ||at^new||oo! for 
t G [{uj + k — l)a + 1, (uj + k)a\. 




b ^ b~*~ (Cy* T C^ne w ,A:— iV^newTnew) ^ 1-2 ^1.06-\/£ + (0.83) \/^new7new^ 


The last inequality follows from Lemma 6.14 
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Appendix D 

Proof of Cauchy-Schwarz inequality for matrices 


Lemma D.l (Cauchy-Schwarz for a sum of vectors). For vectors Xt and yt, 

/ r. \ 2 / \ / 




U=1 


Proof: 

yt 


I 


U =1 


[®1 1 • • • 1 ] 


V 


y\ 


y^ \) 


< 


Xx 


Xr, 


yi 


ya 


Eii-.iii E 


vi=l 


The inequality is hy Cauchy-Schwarz for a single vector. 

Lemma D.2 (Cauchy-Schwarz for a sum of matrices). For matrices Xt and Yt, 

2 


1 


a 


E^.v' 


1=1 


< A„,ax ( ^ ^ XtXt' j A„,ax ( ^ ^ YtYt 


t=l 


t=l 


Proof of Lemma D.l 




t=i 


= max 
ll®ll=i 
lly||=i 


< 


max 
l®l| — 1 
|y||=i 


max 

l®ll= 

|y||=i 


a 

J2{Xt'xy{Yfy) 

i(EIAA||=)( 


EIT'« 


■. 1=1 


■. 1=1 


max x' y XtXt' X • max y' y YtY/ y 

11 - 11=1 ^ ^ 

'^max I XtXt j Amax I YtYt j 


. 1=1 


U=1 


\yt\\l 


■. 1=1 


The inequality is hy Lemma D.l The penultimate line is because ||ai ||2 = x'x. Multiplying both sides by 
gives the desired result. ■ 
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